Add Deflaking tests information to Developer Guide
This updates flaky-tests.md with all of the information on finding and deflaking tests from the presentation to SIG Testing found here: https://www.youtube.com/watch?v=Ewp8LNY_qTg Also, this drops the outdated "Hunting flaky unit tests" section from flaky-tests.md. Co-authored-by: Aaron Crickenberger <spiffxp@google.com>
This commit is contained in:
		
							parent
							
								
									c2fa487203
								
							
						
					
					
						commit
						15de14640b
					
				|  | @ -1,4 +1,4 @@ | ||||||
| # Flaky tests | # Flaky Tests | ||||||
| 
 | 
 | ||||||
| Any test that fails occasionally is "flaky". Since our merges only proceed when | Any test that fails occasionally is "flaky". Since our merges only proceed when | ||||||
| all tests are green, and we have a number of different CI systems running the | all tests are green, and we have a number of different CI systems running the | ||||||
|  | @ -10,7 +10,27 @@ writing our tests defensively. When flakes are identified, we should prioritize | ||||||
| addressing them, either by fixing them or quarantining them off the critical | addressing them, either by fixing them or quarantining them off the critical | ||||||
| path. | path. | ||||||
| 
 | 
 | ||||||
| # Avoiding Flakes | For more information about deflaking Kubernetes tests, watch @liggitt's | ||||||
|  | [presentation from Kubernetes SIG Testing - 2020-08-25](https://www.youtube.com/watch?v=Ewp8LNY_qTg). | ||||||
|  | 
 | ||||||
|  | **Table of Contents** | ||||||
|  | 
 | ||||||
|  | - [Flaky Tests](#flaky-tests) | ||||||
|  |   - [Avoiding Flakes](#avoiding-flakes) | ||||||
|  |   - [Quarantining Flakes](#quarantining-flakes) | ||||||
|  |   - [Hunting Flakes](#hunting-flakes) | ||||||
|  |   - [GitHub Issues for Known Flakes](#github-issues-for-known-flakes) | ||||||
|  |     - [Expectations when a flaky test is assigned to you](#expectations-when-a-flaky-test-is-assigned-to-you) | ||||||
|  |     - [Writing a good flake report](#writing-a-good-flake-report) | ||||||
|  |   - [Deflaking unit tests](#deflaking-unit-tests) | ||||||
|  |   - [Deflaking integration tests](#deflaking-integration-tests) | ||||||
|  |   - [Deflaking e2e tests](#deflaking-e2e-tests) | ||||||
|  |     - [Gathering information](#gathering-information) | ||||||
|  |     - [Filtering and correlating information](#filtering-and-correlating-information) | ||||||
|  |     - [What to look for](#what-to-look-for) | ||||||
|  |   - [Hunting flaky unit tests in Kubernetes](#hunting-flaky-unit-tests-in-kubernetes) | ||||||
|  | 
 | ||||||
|  | ## Avoiding Flakes | ||||||
| 
 | 
 | ||||||
| Write tests defensively. Remember that "almost never" happens all the time when | Write tests defensively. Remember that "almost never" happens all the time when | ||||||
| tests are run thousands of times in a CI environment. Tests need to be tolerant | tests are run thousands of times in a CI environment. Tests need to be tolerant | ||||||
|  | @ -45,7 +65,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever. | ||||||
|     - "expected 3 widgets, found 2, will retry" |     - "expected 3 widgets, found 2, will retry" | ||||||
|     - "expected pod to be in state foo, currently in state bar, will retry" |     - "expected pod to be in state foo, currently in state bar, will retry" | ||||||
| 
 | 
 | ||||||
| # Quarantining Flakes | ## Quarantining Flakes | ||||||
| 
 | 
 | ||||||
| - When quarantining a presubmit test, ensure an issue exists in the current | - When quarantining a presubmit test, ensure an issue exists in the current | ||||||
|   release milestone assigned to the owning SIG. The issue should be labeled |   release milestone assigned to the owning SIG. The issue should be labeled | ||||||
|  | @ -63,7 +83,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever. | ||||||
|   feature. The majority of release-blocking and merge-blocking suites avoid |   feature. The majority of release-blocking and merge-blocking suites avoid | ||||||
|   these jobs unless they're proven to be non-flaky. |   these jobs unless they're proven to be non-flaky. | ||||||
| 
 | 
 | ||||||
| # Hunting Flakes | ## Hunting Flakes | ||||||
| 
 | 
 | ||||||
| We offer the following tools to aid in finding or troubleshooting flakes | We offer the following tools to aid in finding or troubleshooting flakes | ||||||
| 
 | 
 | ||||||
|  | @ -82,7 +102,7 @@ We offer the following tools to aid in finding or troubleshooting flakes | ||||||
| [go.k8s.io/triage]: https://go.k8s.io/triage | [go.k8s.io/triage]: https://go.k8s.io/triage | ||||||
| [testgrid.k8s.io]: https://testgrid.k8s.io | [testgrid.k8s.io]: https://testgrid.k8s.io | ||||||
| 
 | 
 | ||||||
| # GitHub Issues for Known Flakes | ## GitHub Issues for Known Flakes | ||||||
| 
 | 
 | ||||||
| Because flakes may be rare, it's very important that all relevant logs be | Because flakes may be rare, it's very important that all relevant logs be | ||||||
| discoverable from the issue. | discoverable from the issue. | ||||||
|  | @ -105,7 +125,7 @@ flakes is a quick way to gain expertise and community goodwill. | ||||||
| 
 | 
 | ||||||
| [flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake | [flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake | ||||||
| 
 | 
 | ||||||
| ## Expectations when a flaky test is assigned to you | ### Expectations when a flaky test is assigned to you | ||||||
| 
 | 
 | ||||||
| Note that we won't randomly assign these issues to you unless you've opted in or | Note that we won't randomly assign these issues to you unless you've opted in or | ||||||
| you're part of a group that has opted in. We are more than happy to accept help | you're part of a group that has opted in. We are more than happy to accept help | ||||||
|  | @ -140,113 +160,237 @@ release-blocking flakes. Therefore we have the following guidelines: | ||||||
|    name, which will result in the test being quarantined to only those jobs that |    name, which will result in the test being quarantined to only those jobs that | ||||||
|    explicitly run flakes (eg: https://testgrid.k8s.io/google-gce#gci-gce-flaky) |    explicitly run flakes (eg: https://testgrid.k8s.io/google-gce#gci-gce-flaky) | ||||||
| 
 | 
 | ||||||
| # Reproducing unit test flakes | ### Writing a good flake report | ||||||
| 
 | 
 | ||||||
| Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress). | If you are reporting a flake, it is important to include enough information for | ||||||
|  | others to reproduce the issue. When filing the issue, use the | ||||||
|  | [flaking test template](https://github.com/kubernetes/kubernetes/issues/new?labels=kind%2Fflake&template=flaking-test.md). In | ||||||
|  | your issue, answer these following questions: | ||||||
| 
 | 
 | ||||||
| Just | - Is this flaking in multiple jobs? You can search for the flaking test or error | ||||||
|  |   messages using the | ||||||
|  |   [Kubernetes Aggregated Test Results](http://go.k8s.io/triage) tool. | ||||||
|  | - Are there multiple tests in the same package or suite failing with the same apparent error? | ||||||
| 
 | 
 | ||||||
| ``` | In addition, be sure to include the following information: | ||||||
| $ go install golang.org/x/tools/cmd/stress |  | ||||||
| ``` |  | ||||||
| 
 | 
 | ||||||
| Then build your test binary | - A link to [testgrid](https://testgrid.k8s.io/) history for the flaking test's | ||||||
|  |   jobs, filtered to the relevant tests  | ||||||
|  | - The failed test output — this is essential because it makes the issue searchable | ||||||
|  | - A link to the triage query | ||||||
|  | - A link to specific failures | ||||||
|  | - Be sure to tag the relevant SIG, if you know what it is. | ||||||
| 
 | 
 | ||||||
| ``` | For a good example of a flaking test issue, | ||||||
| $ go test -c -race | [check here](https://github.com/kubernetes/kubernetes/issues/93358). | ||||||
| ``` |  | ||||||
| 
 | 
 | ||||||
| Then run it under stress | ([TODO](https://github.com/kubernetes/kubernetes/issues/95528): Move these instructions to the issue template.) | ||||||
| 
 | 
 | ||||||
| ``` | ## Deflaking unit tests | ||||||
| $ stress ./package.test -test.run=FlakyTest |  | ||||||
| ``` |  | ||||||
| 
 | 
 | ||||||
| It runs the command and writes output to `/tmp/gostress-*` files when it fails. | To get started with deflaking unit tests, you will need to first | ||||||
| It periodically reports with run counts. Be careful with tests that use the | reproduce the flaky behavior. Start with a simple attempt to just run | ||||||
| `net/http/httptest` package; they could exhaust the available ports on your | the flaky unit test. For example:  | ||||||
| system! |  | ||||||
| 
 |  | ||||||
| # Hunting flaky unit tests in Kubernetes |  | ||||||
| 
 |  | ||||||
| Sometimes unit tests are flaky. This means that due to (usually) race |  | ||||||
| conditions, they will occasionally fail, even though most of the time they pass. |  | ||||||
| 
 |  | ||||||
| We have a goal of 99.9% flake free tests. This means that there is only one |  | ||||||
| flake in one thousand runs of a test. |  | ||||||
| 
 |  | ||||||
| Running a test 1000 times on your own machine can be tedious and time consuming. |  | ||||||
| Fortunately, there is a better way to achieve this using Kubernetes. |  | ||||||
| 
 |  | ||||||
| _Note: these instructions are mildly hacky for now, as we get run once semantics |  | ||||||
| and logging they will get better_ |  | ||||||
| 
 |  | ||||||
| There is a testing image `brendanburns/flake` up on the docker hub. We will use |  | ||||||
| this image to test our fix. |  | ||||||
| 
 |  | ||||||
| Create a replication controller with the following config: |  | ||||||
| 
 |  | ||||||
| ```yaml |  | ||||||
| apiVersion: v1 |  | ||||||
| kind: ReplicationController |  | ||||||
| metadata: |  | ||||||
|   name: flakecontroller |  | ||||||
| spec: |  | ||||||
|   replicas: 24 |  | ||||||
|   template: |  | ||||||
|     metadata: |  | ||||||
|       labels: |  | ||||||
|         name: flake |  | ||||||
|     spec: |  | ||||||
|       containers: |  | ||||||
|       - name: flake |  | ||||||
|         image: brendanburns/flake |  | ||||||
|         env: |  | ||||||
|         - name: TEST_PACKAGE |  | ||||||
|           value: pkg/tools |  | ||||||
|         - name: REPO_SPEC |  | ||||||
|           value: https://github.com/kubernetes/kubernetes |  | ||||||
| ``` |  | ||||||
| 
 |  | ||||||
| Note that we omit the labels and the selector fields of the replication |  | ||||||
| controller, because they will be populated from the labels field of the pod |  | ||||||
| template by default. |  | ||||||
| 
 | 
 | ||||||
| ```sh | ```sh | ||||||
| kubectl create -f ./controller.yaml | go test ./pkg/kubelet/config -run TestInvalidPodFiltered | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| This will spin up 24 instances of the test. They will run to completion, then | Also make sure that you bypass the `go test` cache by using an uncachable | ||||||
| exit, and the kubelet will restart them, accumulating more and more runs of the | command line option: | ||||||
| test. |  | ||||||
| 
 |  | ||||||
| You can examine the recent runs of the test by calling `docker ps -a` and |  | ||||||
| looking for tasks that exited with non-zero exit codes. Unfortunately, docker |  | ||||||
| ps -a only keeps around the exit status of the last 15-20 containers with the |  | ||||||
| same image, so you have to check them frequently. |  | ||||||
| 
 |  | ||||||
| You can use this script to automate checking for failures, assuming your cluster |  | ||||||
| is running on GCE and has four nodes: |  | ||||||
| 
 | 
 | ||||||
| ```sh | ```sh | ||||||
| echo "" > output.txt | go test ./pkg/kubelet/config -count=1 -run TestInvalidPodFiltered | ||||||
| for i in {1..4}; do |  | ||||||
|   echo "Checking kubernetes-node-${i}" |  | ||||||
|   echo "kubernetes-node-${i}:" >> output.txt |  | ||||||
|   gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt |  | ||||||
| done |  | ||||||
| grep "Exited ([^0])" output.txt |  | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| Eventually you will have sufficient runs for your purposes. At that point you | If even this is not revealing issues with the flaky test, try running with | ||||||
| can delete the replication controller by running: | [race detection](https://golang.org/doc/articles/race_detector.html) enabled: | ||||||
| 
 | 
 | ||||||
| ```sh | ```sh | ||||||
| kubectl delete replicationcontroller flakecontroller | go test ./pkg/kubelet/config -race -count=1 -run TestInvalidPodFiltered | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| If you do a final check for flakes with `docker ps -a`, ignore tasks that | Finally, you can stress test the unit test using the | ||||||
| exited -1, since that's what happens when you stop the replication controller. | [stress command](https://godoc.org/golang.org/x/tools/cmd/stress). Install it | ||||||
|  | with this command:  | ||||||
| 
 | 
 | ||||||
| Happy flake hunting! | ```sh | ||||||
|  | go get golang.org/x/tools/cmd/stress | ||||||
|  | ``` | ||||||
| 
 | 
 | ||||||
|  | Then build your test binary: | ||||||
|  | 
 | ||||||
|  | ```sh | ||||||
|  | go test ./pkg/kubelet/config -race -c | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | Then run it under stress: | ||||||
|  | 
 | ||||||
|  | ```sh | ||||||
|  | stress ./config.test -test.run TestInvalidPodFiltered | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | The stress command runs the test binary repeatedly, reporting when it fails. It | ||||||
|  | will periodically report how many times it has run and how many failures have | ||||||
|  | occurred.  | ||||||
|  | 
 | ||||||
|  | You should see output like this: | ||||||
|  | 
 | ||||||
|  | ``` | ||||||
|  | 411 runs so far, 0 failures | ||||||
|  | /var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20200825T115041-341977266 | ||||||
|  | --- FAIL: TestInvalidPodFiltered (0.00s) | ||||||
|  |     config_test.go:126: Expected no update in channel, Got types.PodUpdate{Pods:[]*v1.Pod{(*v1.Pod)(0xc00059e400)}, Op:1, Source:"test"} | ||||||
|  | FAIL | ||||||
|  | ERROR: exit status 1 | ||||||
|  | 815 runs so far, 1 failures | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | Be careful with tests that use the `net/http/httptest` package; they could | ||||||
|  | exhaust the available ports on your system! | ||||||
|  | 
 | ||||||
|  | ## Deflaking integration tests | ||||||
|  | 
 | ||||||
|  | Integration tests run similarly to unit tests, but they almost always expect a | ||||||
|  | running `etcd` instance. You should already have `etcd` installed if you have | ||||||
|  | followed the instructions in the [Development Guide](../development.md). Run | ||||||
|  | `etcd` in another shell window or tab. | ||||||
|  | 
 | ||||||
|  | Compile your integration test using a command like this: | ||||||
|  | 
 | ||||||
|  | ```sh | ||||||
|  | go test -c -race ./test/integration/endpointslice | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | And then stress test the flaky test using the `stress` command: | ||||||
|  | 
 | ||||||
|  | ```sh | ||||||
|  | stress ./endpointslice.test -test.run TestEndpointSliceMirroring | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | For an example of a failing or flaky integration test, | ||||||
|  | [read this issue](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-678375312).  | ||||||
|  | 
 | ||||||
|  | Sometimes, but not often, a test will fail due to timeouts caused by | ||||||
|  | deadlocks. This can be tracked down by stress testing an entire package. The way | ||||||
|  | to track this down is to stress test individual tests in a package. This process | ||||||
|  | can take extra effort. Try following these steps: | ||||||
|  | 
 | ||||||
|  | 1. Run each test in the package individually to figure out the average runtime. | ||||||
|  | 2. Stress each test individually, bounding the timeout to 100 times the average run time. | ||||||
|  | 3. Isolate the particular test that is deadlocking. | ||||||
|  | 4. Add debug output to figure out what is causing the deadlock. | ||||||
|  | 
 | ||||||
|  | Hopefully this can help narrow down exactly where the deadlock is occurring, | ||||||
|  | revealing a simple fix! | ||||||
|  | 
 | ||||||
|  | ## Deflaking e2e tests | ||||||
|  | 
 | ||||||
|  | A flaky [end-to-end (e2e) test](e2e-tests.md) offers its own set of | ||||||
|  | challenges. In particular, these tests are difficult because they test the | ||||||
|  | entire Kubernetes system. This can be both good and bad. It can be good because | ||||||
|  | we want the entire system to work when testing, but an e2e test can also fail | ||||||
|  | because of something completely unrelated, such as failing infrastructure or | ||||||
|  | misconfigured volumes. Be aware that you can't simply look at the title of an | ||||||
|  | e2e test to understand exactly what is being tested. If possible, look for unit | ||||||
|  | and integration tests related to the problem you are trying to solve. | ||||||
|  | 
 | ||||||
|  | ### Gathering information | ||||||
|  | 
 | ||||||
|  | The first step in deflaking an e2e test is to gather information. We capture a | ||||||
|  | lot of information from e2e test runs, and you can use these artifacts to gather | ||||||
|  | information as to why a test is failing. | ||||||
|  | 
 | ||||||
|  | Use the [Prow Status](https://prow.k8s.io/) tool to collect information on | ||||||
|  | specific test jobs. Drill down into a job and use the **Artifacts** tab to | ||||||
|  | collect information. For example, with | ||||||
|  | [this particular test job](https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1296558932902285312), | ||||||
|  | we can collect the following: | ||||||
|  | 
 | ||||||
|  | * `build-log.txt` | ||||||
|  | * In the control plane directory: `artifacts/e2e-171671cb3f-674b9-master/` | ||||||
|  |   * `kube-apiserver-audit.log` (and rotated files) | ||||||
|  |   * `kube-apiserver.log` | ||||||
|  |   * `kube-controller-manager.log` | ||||||
|  |   * `kube-scheduler.log` | ||||||
|  |   * And more! | ||||||
|  | 
 | ||||||
|  | The `artifacts/` directory will contain much more information. From inside the | ||||||
|  | directories for each node: | ||||||
|  | - `e2e-171671cb3f-674b9-minion-group-drkr` | ||||||
|  | - `e2e-171671cb3f-674b9-minion-group-lr2z` | ||||||
|  | - `e2e-171671cb3f-674b9-minion-group-qkkz` | ||||||
|  | 
 | ||||||
|  | Look for these files: | ||||||
|  | * `kubelet.log` | ||||||
|  | * `docker.log` | ||||||
|  | * `kube-proxy.log` | ||||||
|  | * And so forth. | ||||||
|  | 
 | ||||||
|  | ### Filtering and correlating information | ||||||
|  | 
 | ||||||
|  | Once you have gathered your information, the next step is to filter and | ||||||
|  | correlate the information. This can require some familiarity with the issue you are tracking | ||||||
|  | down, but look first at the relevant components, such as the test log, logs for the API | ||||||
|  | server, controller manager, and `kubelet`. | ||||||
|  | 
 | ||||||
|  | Filter the logs to find events that happened around the time of the failure and | ||||||
|  | events that occurred in related namespaces and objects. | ||||||
|  | 
 | ||||||
|  | The goal is to collate log entries from all of these different files so you can | ||||||
|  | get a picture of what was happening in the distributed system. This will help | ||||||
|  | you figure out exactly where the e2e test is failing. One tool that may help you | ||||||
|  | with this is [k8s-e2e-log-combiner](https://github.com/brianpursley/k8s-e2e-log-combiner) | ||||||
|  | 
 | ||||||
|  | Kubernetes has a lot of nested systems, so sometimes log entries can refer to | ||||||
|  | events happening three levels deep. This means that line numbers in logs might | ||||||
|  | not refer to where problems and messages originate. Do not make any assumptions | ||||||
|  | about where messages are initiated! | ||||||
|  | 
 | ||||||
|  | If you have trouble finding relevant logging information or events, don't be | ||||||
|  | afraid to add debugging output to the test. For an example of this approach, | ||||||
|  | [see this issue](https://github.com/kubernetes/kubernetes/pull/88297#issuecomment-588607417). | ||||||
|  | 
 | ||||||
|  | ### What to look for | ||||||
|  | 
 | ||||||
|  | One of the first things to look for is if the test is assuming that something is | ||||||
|  | running synchronously when it actually runs asynchronously. For example, if the | ||||||
|  | test is kicking off a goroutine, you might need to add delays to simulate slow | ||||||
|  | operations and reproduce issues. | ||||||
|  | 
 | ||||||
|  | Examples of the types of changes you could make to try to force a failure: | ||||||
|  |   - `time.Sleep(time.Second)` at the top of a goroutine | ||||||
|  |   - `time.Sleep(time.Second)` at the beginning of a watch event handler | ||||||
|  |   - `time.Sleep(time.Second)` at the end of a watch event handler | ||||||
|  |   - `time.Sleep(time.Second)` at the beginning of a sync loop worker | ||||||
|  |   - `time.Sleep(time.Second)` at the end of a sync loop worker | ||||||
|  | 
 | ||||||
|  | Sometimes, | ||||||
|  | [such as in this example](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-675631856), | ||||||
|  | a test might be causing a race condition with the system it is trying to | ||||||
|  | test. Investigate if the test is conflicting with an asynchronous background | ||||||
|  | process. To verify the issue, simulate the test losing the race by putting a | ||||||
|  | `time.Sleep(time.Second)` between test steps.  | ||||||
|  | 
 | ||||||
|  | If a test is assuming that an operation will happen quickly, it might not be | ||||||
|  | taking into account the configuration of a CI environment. A CI environment will | ||||||
|  | generally be more resource-constrained and will run multiple tests in | ||||||
|  | parallel. If it runs in less than a second locally, it could take a few seconds | ||||||
|  | in a CI environment. | ||||||
|  | 
 | ||||||
|  | Unless your test is specifically testing performance/timing, don't set tight | ||||||
|  | timing tolerances. Use `wait.ForeverTestTimeout`, which is a reasonable stand-in | ||||||
|  | for operations that should not take very long. This is a better approach than | ||||||
|  | polling for 1 to 10 seconds. | ||||||
|  | 
 | ||||||
|  | Is the test incorrectly assuming deterministic output? Remember that map iteration in go is | ||||||
|  | non-deterministic. If there is a list being compiled or a set of steps are being | ||||||
|  | performed by iterating over a map, they will not be completed in a predictable | ||||||
|  | order. Make sure the test is able to tolerate any order in a map. | ||||||
|  | 
 | ||||||
|  | Be aware that if a test is mixing random allocation with static allocation, that | ||||||
|  | there will be intermittent conflicts. | ||||||
|  | 
 | ||||||
|  | Finally, if you are using a fake client with a watcher, it can relist/rewatch at any point. | ||||||
|  | It is better to look for specific actions in the fake client rather than | ||||||
|  | asserting exact content of the full set. | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue