Add Deflaking tests information to Developer Guide
This updates flaky-tests.md with all of the information on finding and deflaking tests from the presentation to SIG Testing found here: https://www.youtube.com/watch?v=Ewp8LNY_qTg Also, this drops the outdated "Hunting flaky unit tests" section from flaky-tests.md. Co-authored-by: Aaron Crickenberger <spiffxp@google.com>
This commit is contained in:
parent
c2fa487203
commit
15de14640b
|
|
@ -1,4 +1,4 @@
|
|||
# Flaky tests
|
||||
# Flaky Tests
|
||||
|
||||
Any test that fails occasionally is "flaky". Since our merges only proceed when
|
||||
all tests are green, and we have a number of different CI systems running the
|
||||
|
|
@ -10,7 +10,27 @@ writing our tests defensively. When flakes are identified, we should prioritize
|
|||
addressing them, either by fixing them or quarantining them off the critical
|
||||
path.
|
||||
|
||||
# Avoiding Flakes
|
||||
For more information about deflaking Kubernetes tests, watch @liggitt's
|
||||
[presentation from Kubernetes SIG Testing - 2020-08-25](https://www.youtube.com/watch?v=Ewp8LNY_qTg).
|
||||
|
||||
**Table of Contents**
|
||||
|
||||
- [Flaky Tests](#flaky-tests)
|
||||
- [Avoiding Flakes](#avoiding-flakes)
|
||||
- [Quarantining Flakes](#quarantining-flakes)
|
||||
- [Hunting Flakes](#hunting-flakes)
|
||||
- [GitHub Issues for Known Flakes](#github-issues-for-known-flakes)
|
||||
- [Expectations when a flaky test is assigned to you](#expectations-when-a-flaky-test-is-assigned-to-you)
|
||||
- [Writing a good flake report](#writing-a-good-flake-report)
|
||||
- [Deflaking unit tests](#deflaking-unit-tests)
|
||||
- [Deflaking integration tests](#deflaking-integration-tests)
|
||||
- [Deflaking e2e tests](#deflaking-e2e-tests)
|
||||
- [Gathering information](#gathering-information)
|
||||
- [Filtering and correlating information](#filtering-and-correlating-information)
|
||||
- [What to look for](#what-to-look-for)
|
||||
- [Hunting flaky unit tests in Kubernetes](#hunting-flaky-unit-tests-in-kubernetes)
|
||||
|
||||
## Avoiding Flakes
|
||||
|
||||
Write tests defensively. Remember that "almost never" happens all the time when
|
||||
tests are run thousands of times in a CI environment. Tests need to be tolerant
|
||||
|
|
@ -45,7 +65,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever.
|
|||
- "expected 3 widgets, found 2, will retry"
|
||||
- "expected pod to be in state foo, currently in state bar, will retry"
|
||||
|
||||
# Quarantining Flakes
|
||||
## Quarantining Flakes
|
||||
|
||||
- When quarantining a presubmit test, ensure an issue exists in the current
|
||||
release milestone assigned to the owning SIG. The issue should be labeled
|
||||
|
|
@ -63,7 +83,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever.
|
|||
feature. The majority of release-blocking and merge-blocking suites avoid
|
||||
these jobs unless they're proven to be non-flaky.
|
||||
|
||||
# Hunting Flakes
|
||||
## Hunting Flakes
|
||||
|
||||
We offer the following tools to aid in finding or troubleshooting flakes
|
||||
|
||||
|
|
@ -82,7 +102,7 @@ We offer the following tools to aid in finding or troubleshooting flakes
|
|||
[go.k8s.io/triage]: https://go.k8s.io/triage
|
||||
[testgrid.k8s.io]: https://testgrid.k8s.io
|
||||
|
||||
# GitHub Issues for Known Flakes
|
||||
## GitHub Issues for Known Flakes
|
||||
|
||||
Because flakes may be rare, it's very important that all relevant logs be
|
||||
discoverable from the issue.
|
||||
|
|
@ -105,7 +125,7 @@ flakes is a quick way to gain expertise and community goodwill.
|
|||
|
||||
[flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake
|
||||
|
||||
## Expectations when a flaky test is assigned to you
|
||||
### Expectations when a flaky test is assigned to you
|
||||
|
||||
Note that we won't randomly assign these issues to you unless you've opted in or
|
||||
you're part of a group that has opted in. We are more than happy to accept help
|
||||
|
|
@ -140,113 +160,237 @@ release-blocking flakes. Therefore we have the following guidelines:
|
|||
name, which will result in the test being quarantined to only those jobs that
|
||||
explicitly run flakes (eg: https://testgrid.k8s.io/google-gce#gci-gce-flaky)
|
||||
|
||||
# Reproducing unit test flakes
|
||||
### Writing a good flake report
|
||||
|
||||
Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
|
||||
If you are reporting a flake, it is important to include enough information for
|
||||
others to reproduce the issue. When filing the issue, use the
|
||||
[flaking test template](https://github.com/kubernetes/kubernetes/issues/new?labels=kind%2Fflake&template=flaking-test.md). In
|
||||
your issue, answer these following questions:
|
||||
|
||||
Just
|
||||
- Is this flaking in multiple jobs? You can search for the flaking test or error
|
||||
messages using the
|
||||
[Kubernetes Aggregated Test Results](http://go.k8s.io/triage) tool.
|
||||
- Are there multiple tests in the same package or suite failing with the same apparent error?
|
||||
|
||||
```
|
||||
$ go install golang.org/x/tools/cmd/stress
|
||||
```
|
||||
In addition, be sure to include the following information:
|
||||
|
||||
Then build your test binary
|
||||
- A link to [testgrid](https://testgrid.k8s.io/) history for the flaking test's
|
||||
jobs, filtered to the relevant tests
|
||||
- The failed test output — this is essential because it makes the issue searchable
|
||||
- A link to the triage query
|
||||
- A link to specific failures
|
||||
- Be sure to tag the relevant SIG, if you know what it is.
|
||||
|
||||
```
|
||||
$ go test -c -race
|
||||
```
|
||||
For a good example of a flaking test issue,
|
||||
[check here](https://github.com/kubernetes/kubernetes/issues/93358).
|
||||
|
||||
Then run it under stress
|
||||
([TODO](https://github.com/kubernetes/kubernetes/issues/95528): Move these instructions to the issue template.)
|
||||
|
||||
```
|
||||
$ stress ./package.test -test.run=FlakyTest
|
||||
```
|
||||
## Deflaking unit tests
|
||||
|
||||
It runs the command and writes output to `/tmp/gostress-*` files when it fails.
|
||||
It periodically reports with run counts. Be careful with tests that use the
|
||||
`net/http/httptest` package; they could exhaust the available ports on your
|
||||
system!
|
||||
|
||||
# Hunting flaky unit tests in Kubernetes
|
||||
|
||||
Sometimes unit tests are flaky. This means that due to (usually) race
|
||||
conditions, they will occasionally fail, even though most of the time they pass.
|
||||
|
||||
We have a goal of 99.9% flake free tests. This means that there is only one
|
||||
flake in one thousand runs of a test.
|
||||
|
||||
Running a test 1000 times on your own machine can be tedious and time consuming.
|
||||
Fortunately, there is a better way to achieve this using Kubernetes.
|
||||
|
||||
_Note: these instructions are mildly hacky for now, as we get run once semantics
|
||||
and logging they will get better_
|
||||
|
||||
There is a testing image `brendanburns/flake` up on the docker hub. We will use
|
||||
this image to test our fix.
|
||||
|
||||
Create a replication controller with the following config:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ReplicationController
|
||||
metadata:
|
||||
name: flakecontroller
|
||||
spec:
|
||||
replicas: 24
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
name: flake
|
||||
spec:
|
||||
containers:
|
||||
- name: flake
|
||||
image: brendanburns/flake
|
||||
env:
|
||||
- name: TEST_PACKAGE
|
||||
value: pkg/tools
|
||||
- name: REPO_SPEC
|
||||
value: https://github.com/kubernetes/kubernetes
|
||||
```
|
||||
|
||||
Note that we omit the labels and the selector fields of the replication
|
||||
controller, because they will be populated from the labels field of the pod
|
||||
template by default.
|
||||
To get started with deflaking unit tests, you will need to first
|
||||
reproduce the flaky behavior. Start with a simple attempt to just run
|
||||
the flaky unit test. For example:
|
||||
|
||||
```sh
|
||||
kubectl create -f ./controller.yaml
|
||||
go test ./pkg/kubelet/config -run TestInvalidPodFiltered
|
||||
```
|
||||
|
||||
This will spin up 24 instances of the test. They will run to completion, then
|
||||
exit, and the kubelet will restart them, accumulating more and more runs of the
|
||||
test.
|
||||
|
||||
You can examine the recent runs of the test by calling `docker ps -a` and
|
||||
looking for tasks that exited with non-zero exit codes. Unfortunately, docker
|
||||
ps -a only keeps around the exit status of the last 15-20 containers with the
|
||||
same image, so you have to check them frequently.
|
||||
|
||||
You can use this script to automate checking for failures, assuming your cluster
|
||||
is running on GCE and has four nodes:
|
||||
Also make sure that you bypass the `go test` cache by using an uncachable
|
||||
command line option:
|
||||
|
||||
```sh
|
||||
echo "" > output.txt
|
||||
for i in {1..4}; do
|
||||
echo "Checking kubernetes-node-${i}"
|
||||
echo "kubernetes-node-${i}:" >> output.txt
|
||||
gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt
|
||||
done
|
||||
grep "Exited ([^0])" output.txt
|
||||
go test ./pkg/kubelet/config -count=1 -run TestInvalidPodFiltered
|
||||
```
|
||||
|
||||
Eventually you will have sufficient runs for your purposes. At that point you
|
||||
can delete the replication controller by running:
|
||||
If even this is not revealing issues with the flaky test, try running with
|
||||
[race detection](https://golang.org/doc/articles/race_detector.html) enabled:
|
||||
|
||||
```sh
|
||||
kubectl delete replicationcontroller flakecontroller
|
||||
go test ./pkg/kubelet/config -race -count=1 -run TestInvalidPodFiltered
|
||||
```
|
||||
|
||||
If you do a final check for flakes with `docker ps -a`, ignore tasks that
|
||||
exited -1, since that's what happens when you stop the replication controller.
|
||||
Finally, you can stress test the unit test using the
|
||||
[stress command](https://godoc.org/golang.org/x/tools/cmd/stress). Install it
|
||||
with this command:
|
||||
|
||||
Happy flake hunting!
|
||||
```sh
|
||||
go get golang.org/x/tools/cmd/stress
|
||||
```
|
||||
|
||||
Then build your test binary:
|
||||
|
||||
```sh
|
||||
go test ./pkg/kubelet/config -race -c
|
||||
```
|
||||
|
||||
Then run it under stress:
|
||||
|
||||
```sh
|
||||
stress ./config.test -test.run TestInvalidPodFiltered
|
||||
```
|
||||
|
||||
The stress command runs the test binary repeatedly, reporting when it fails. It
|
||||
will periodically report how many times it has run and how many failures have
|
||||
occurred.
|
||||
|
||||
You should see output like this:
|
||||
|
||||
```
|
||||
411 runs so far, 0 failures
|
||||
/var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20200825T115041-341977266
|
||||
--- FAIL: TestInvalidPodFiltered (0.00s)
|
||||
config_test.go:126: Expected no update in channel, Got types.PodUpdate{Pods:[]*v1.Pod{(*v1.Pod)(0xc00059e400)}, Op:1, Source:"test"}
|
||||
FAIL
|
||||
ERROR: exit status 1
|
||||
815 runs so far, 1 failures
|
||||
```
|
||||
|
||||
Be careful with tests that use the `net/http/httptest` package; they could
|
||||
exhaust the available ports on your system!
|
||||
|
||||
## Deflaking integration tests
|
||||
|
||||
Integration tests run similarly to unit tests, but they almost always expect a
|
||||
running `etcd` instance. You should already have `etcd` installed if you have
|
||||
followed the instructions in the [Development Guide](../development.md). Run
|
||||
`etcd` in another shell window or tab.
|
||||
|
||||
Compile your integration test using a command like this:
|
||||
|
||||
```sh
|
||||
go test -c -race ./test/integration/endpointslice
|
||||
```
|
||||
|
||||
And then stress test the flaky test using the `stress` command:
|
||||
|
||||
```sh
|
||||
stress ./endpointslice.test -test.run TestEndpointSliceMirroring
|
||||
```
|
||||
|
||||
For an example of a failing or flaky integration test,
|
||||
[read this issue](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-678375312).
|
||||
|
||||
Sometimes, but not often, a test will fail due to timeouts caused by
|
||||
deadlocks. This can be tracked down by stress testing an entire package. The way
|
||||
to track this down is to stress test individual tests in a package. This process
|
||||
can take extra effort. Try following these steps:
|
||||
|
||||
1. Run each test in the package individually to figure out the average runtime.
|
||||
2. Stress each test individually, bounding the timeout to 100 times the average run time.
|
||||
3. Isolate the particular test that is deadlocking.
|
||||
4. Add debug output to figure out what is causing the deadlock.
|
||||
|
||||
Hopefully this can help narrow down exactly where the deadlock is occurring,
|
||||
revealing a simple fix!
|
||||
|
||||
## Deflaking e2e tests
|
||||
|
||||
A flaky [end-to-end (e2e) test](e2e-tests.md) offers its own set of
|
||||
challenges. In particular, these tests are difficult because they test the
|
||||
entire Kubernetes system. This can be both good and bad. It can be good because
|
||||
we want the entire system to work when testing, but an e2e test can also fail
|
||||
because of something completely unrelated, such as failing infrastructure or
|
||||
misconfigured volumes. Be aware that you can't simply look at the title of an
|
||||
e2e test to understand exactly what is being tested. If possible, look for unit
|
||||
and integration tests related to the problem you are trying to solve.
|
||||
|
||||
### Gathering information
|
||||
|
||||
The first step in deflaking an e2e test is to gather information. We capture a
|
||||
lot of information from e2e test runs, and you can use these artifacts to gather
|
||||
information as to why a test is failing.
|
||||
|
||||
Use the [Prow Status](https://prow.k8s.io/) tool to collect information on
|
||||
specific test jobs. Drill down into a job and use the **Artifacts** tab to
|
||||
collect information. For example, with
|
||||
[this particular test job](https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1296558932902285312),
|
||||
we can collect the following:
|
||||
|
||||
* `build-log.txt`
|
||||
* In the control plane directory: `artifacts/e2e-171671cb3f-674b9-master/`
|
||||
* `kube-apiserver-audit.log` (and rotated files)
|
||||
* `kube-apiserver.log`
|
||||
* `kube-controller-manager.log`
|
||||
* `kube-scheduler.log`
|
||||
* And more!
|
||||
|
||||
The `artifacts/` directory will contain much more information. From inside the
|
||||
directories for each node:
|
||||
- `e2e-171671cb3f-674b9-minion-group-drkr`
|
||||
- `e2e-171671cb3f-674b9-minion-group-lr2z`
|
||||
- `e2e-171671cb3f-674b9-minion-group-qkkz`
|
||||
|
||||
Look for these files:
|
||||
* `kubelet.log`
|
||||
* `docker.log`
|
||||
* `kube-proxy.log`
|
||||
* And so forth.
|
||||
|
||||
### Filtering and correlating information
|
||||
|
||||
Once you have gathered your information, the next step is to filter and
|
||||
correlate the information. This can require some familiarity with the issue you are tracking
|
||||
down, but look first at the relevant components, such as the test log, logs for the API
|
||||
server, controller manager, and `kubelet`.
|
||||
|
||||
Filter the logs to find events that happened around the time of the failure and
|
||||
events that occurred in related namespaces and objects.
|
||||
|
||||
The goal is to collate log entries from all of these different files so you can
|
||||
get a picture of what was happening in the distributed system. This will help
|
||||
you figure out exactly where the e2e test is failing. One tool that may help you
|
||||
with this is [k8s-e2e-log-combiner](https://github.com/brianpursley/k8s-e2e-log-combiner)
|
||||
|
||||
Kubernetes has a lot of nested systems, so sometimes log entries can refer to
|
||||
events happening three levels deep. This means that line numbers in logs might
|
||||
not refer to where problems and messages originate. Do not make any assumptions
|
||||
about where messages are initiated!
|
||||
|
||||
If you have trouble finding relevant logging information or events, don't be
|
||||
afraid to add debugging output to the test. For an example of this approach,
|
||||
[see this issue](https://github.com/kubernetes/kubernetes/pull/88297#issuecomment-588607417).
|
||||
|
||||
### What to look for
|
||||
|
||||
One of the first things to look for is if the test is assuming that something is
|
||||
running synchronously when it actually runs asynchronously. For example, if the
|
||||
test is kicking off a goroutine, you might need to add delays to simulate slow
|
||||
operations and reproduce issues.
|
||||
|
||||
Examples of the types of changes you could make to try to force a failure:
|
||||
- `time.Sleep(time.Second)` at the top of a goroutine
|
||||
- `time.Sleep(time.Second)` at the beginning of a watch event handler
|
||||
- `time.Sleep(time.Second)` at the end of a watch event handler
|
||||
- `time.Sleep(time.Second)` at the beginning of a sync loop worker
|
||||
- `time.Sleep(time.Second)` at the end of a sync loop worker
|
||||
|
||||
Sometimes,
|
||||
[such as in this example](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-675631856),
|
||||
a test might be causing a race condition with the system it is trying to
|
||||
test. Investigate if the test is conflicting with an asynchronous background
|
||||
process. To verify the issue, simulate the test losing the race by putting a
|
||||
`time.Sleep(time.Second)` between test steps.
|
||||
|
||||
If a test is assuming that an operation will happen quickly, it might not be
|
||||
taking into account the configuration of a CI environment. A CI environment will
|
||||
generally be more resource-constrained and will run multiple tests in
|
||||
parallel. If it runs in less than a second locally, it could take a few seconds
|
||||
in a CI environment.
|
||||
|
||||
Unless your test is specifically testing performance/timing, don't set tight
|
||||
timing tolerances. Use `wait.ForeverTestTimeout`, which is a reasonable stand-in
|
||||
for operations that should not take very long. This is a better approach than
|
||||
polling for 1 to 10 seconds.
|
||||
|
||||
Is the test incorrectly assuming deterministic output? Remember that map iteration in go is
|
||||
non-deterministic. If there is a list being compiled or a set of steps are being
|
||||
performed by iterating over a map, they will not be completed in a predictable
|
||||
order. Make sure the test is able to tolerate any order in a map.
|
||||
|
||||
Be aware that if a test is mixing random allocation with static allocation, that
|
||||
there will be intermittent conflicts.
|
||||
|
||||
Finally, if you are using a fake client with a watcher, it can relist/rewatch at any point.
|
||||
It is better to look for specific actions in the fake client rather than
|
||||
asserting exact content of the full set.
|
||||
|
|
|
|||
Loading…
Reference in New Issue