Add Deflaking tests information to Developer Guide

This updates flaky-tests.md with all of the information on finding and deflaking tests from the presentation to SIG Testing found here: https://www.youtube.com/watch?v=Ewp8LNY_qTg Also, this drops the outdated "Hunting flaky unit tests" section from flaky-tests.md. Co-authored-by: Aaron Crickenberger <spiffxp@google.com>
2020-10-08 15:23:56 -07:00 · 2020-10-08 15:23:56 -07:00 · 15de14640b
parent c2fa487203
commit 15de14640b
1 changed files with 238 additions and 94 deletions
--- a/contributors/devel/sig-testing/flaky-tests.md
+++ b/contributors/devel/sig-testing/flaky-tests.md
@ -1,4 +1,4 @@
-# Flaky tests
+# Flaky Tests

 Any test that fails occasionally is "flaky". Since our merges only proceed when
 all tests are green, and we have a number of different CI systems running the
@ -10,7 +10,27 @@ writing our tests defensively. When flakes are identified, we should prioritize
 addressing them, either by fixing them or quarantining them off the critical
 path.

-# Avoiding Flakes
+For more information about deflaking Kubernetes tests, watch @liggitt's
+[presentation from Kubernetes SIG Testing - 2020-08-25](https://www.youtube.com/watch?v=Ewp8LNY_qTg).
+
+**Table of Contents**
+
+- [Flaky Tests](#flaky-tests)
+  - [Avoiding Flakes](#avoiding-flakes)
+  - [Quarantining Flakes](#quarantining-flakes)
+  - [Hunting Flakes](#hunting-flakes)
+  - [GitHub Issues for Known Flakes](#github-issues-for-known-flakes)
+    - [Expectations when a flaky test is assigned to you](#expectations-when-a-flaky-test-is-assigned-to-you)
+    - [Writing a good flake report](#writing-a-good-flake-report)
+  - [Deflaking unit tests](#deflaking-unit-tests)
+  - [Deflaking integration tests](#deflaking-integration-tests)
+  - [Deflaking e2e tests](#deflaking-e2e-tests)
+    - [Gathering information](#gathering-information)
+    - [Filtering and correlating information](#filtering-and-correlating-information)
+    - [What to look for](#what-to-look-for)
+  - [Hunting flaky unit tests in Kubernetes](#hunting-flaky-unit-tests-in-kubernetes)
+
+## Avoiding Flakes

 Write tests defensively. Remember that "almost never" happens all the time when
 tests are run thousands of times in a CI environment. Tests need to be tolerant
@ -45,7 +65,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever.
    - "expected 3 widgets, found 2, will retry"
    - "expected pod to be in state foo, currently in state bar, will retry"

-# Quarantining Flakes
+## Quarantining Flakes

 - When quarantining a presubmit test, ensure an issue exists in the current
  release milestone assigned to the owning SIG. The issue should be labeled
@ -63,7 +83,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever.
  feature. The majority of release-blocking and merge-blocking suites avoid
  these jobs unless they're proven to be non-flaky.

-# Hunting Flakes
+## Hunting Flakes

 We offer the following tools to aid in finding or troubleshooting flakes

@ -82,7 +102,7 @@ We offer the following tools to aid in finding or troubleshooting flakes
 [go.k8s.io/triage]: https://go.k8s.io/triage
 [testgrid.k8s.io]: https://testgrid.k8s.io

-# GitHub Issues for Known Flakes
+## GitHub Issues for Known Flakes

 Because flakes may be rare, it's very important that all relevant logs be
 discoverable from the issue.
@ -105,7 +125,7 @@ flakes is a quick way to gain expertise and community goodwill.

 [flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake

-## Expectations when a flaky test is assigned to you
+### Expectations when a flaky test is assigned to you

 Note that we won't randomly assign these issues to you unless you've opted in or
 you're part of a group that has opted in. We are more than happy to accept help
@ -140,113 +160,237 @@ release-blocking flakes. Therefore we have the following guidelines:
   name, which will result in the test being quarantined to only those jobs that
   explicitly run flakes (eg: https://testgrid.k8s.io/google-gce#gci-gce-flaky)

-# Reproducing unit test flakes
+### Writing a good flake report

-Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
+If you are reporting a flake, it is important to include enough information for
+others to reproduce the issue. When filing the issue, use the
+[flaking test template](https://github.com/kubernetes/kubernetes/issues/new?labels=kind%2Fflake&template=flaking-test.md). In
+your issue, answer these following questions:

-Just
+- Is this flaking in multiple jobs? You can search for the flaking test or error
+  messages using the
+  [Kubernetes Aggregated Test Results](http://go.k8s.io/triage) tool.
+- Are there multiple tests in the same package or suite failing with the same apparent error?

-```
-$ go install golang.org/x/tools/cmd/stress
-```
+In addition, be sure to include the following information:

-Then build your test binary
+- A link to [testgrid](https://testgrid.k8s.io/) history for the flaking test's
+  jobs, filtered to the relevant tests 
+- The failed test output &mdash; this is essential because it makes the issue searchable
+- A link to the triage query
+- A link to specific failures
+- Be sure to tag the relevant SIG, if you know what it is.

-```
-$ go test -c -race
-```
+For a good example of a flaking test issue,
+[check here](https://github.com/kubernetes/kubernetes/issues/93358).

-Then run it under stress
+([TODO](https://github.com/kubernetes/kubernetes/issues/95528): Move these instructions to the issue template.)

-```
-$ stress ./package.test -test.run=FlakyTest
-```
+## Deflaking unit tests

-It runs the command and writes output to `/tmp/gostress-*` files when it fails.
-It periodically reports with run counts. Be careful with tests that use the
-`net/http/httptest` package; they could exhaust the available ports on your
-system!
-
-# Hunting flaky unit tests in Kubernetes
-
-Sometimes unit tests are flaky. This means that due to (usually) race
-conditions, they will occasionally fail, even though most of the time they pass.
-
-We have a goal of 99.9% flake free tests. This means that there is only one
-flake in one thousand runs of a test.
-
-Running a test 1000 times on your own machine can be tedious and time consuming.
-Fortunately, there is a better way to achieve this using Kubernetes.
-
-_Note: these instructions are mildly hacky for now, as we get run once semantics
-and logging they will get better_
-
-There is a testing image `brendanburns/flake` up on the docker hub. We will use
-this image to test our fix.
-
-Create a replication controller with the following config:
-
-```yaml
-apiVersion: v1
-kind: ReplicationController
-metadata:
-  name: flakecontroller
-spec:
-  replicas: 24
-  template:
-    metadata:
-      labels:
-        name: flake
-    spec:
-      containers:
-      - name: flake
-        image: brendanburns/flake
-        env:
-        - name: TEST_PACKAGE
-          value: pkg/tools
-        - name: REPO_SPEC
-          value: https://github.com/kubernetes/kubernetes
-```
-
-Note that we omit the labels and the selector fields of the replication
-controller, because they will be populated from the labels field of the pod
-template by default.
+To get started with deflaking unit tests, you will need to first
+reproduce the flaky behavior. Start with a simple attempt to just run
+the flaky unit test. For example: 

 ```sh
-kubectl create -f ./controller.yaml
+go test ./pkg/kubelet/config -run TestInvalidPodFiltered
 ```

-This will spin up 24 instances of the test. They will run to completion, then
-exit, and the kubelet will restart them, accumulating more and more runs of the
-test.
-
-You can examine the recent runs of the test by calling `docker ps -a` and
-looking for tasks that exited with non-zero exit codes. Unfortunately, docker
-ps -a only keeps around the exit status of the last 15-20 containers with the
-same image, so you have to check them frequently.
-
-You can use this script to automate checking for failures, assuming your cluster
-is running on GCE and has four nodes:
+Also make sure that you bypass the `go test` cache by using an uncachable
+command line option:

 ```sh
-echo "" > output.txt
-for i in {1..4}; do
-  echo "Checking kubernetes-node-${i}"
-  echo "kubernetes-node-${i}:" >> output.txt
-  gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt
-done
-grep "Exited ([^0])" output.txt
+go test ./pkg/kubelet/config -count=1 -run TestInvalidPodFiltered
 ```

-Eventually you will have sufficient runs for your purposes. At that point you
-can delete the replication controller by running:
+If even this is not revealing issues with the flaky test, try running with
+[race detection](https://golang.org/doc/articles/race_detector.html) enabled:

 ```sh
-kubectl delete replicationcontroller flakecontroller
+go test ./pkg/kubelet/config -race -count=1 -run TestInvalidPodFiltered
 ```

-If you do a final check for flakes with `docker ps -a`, ignore tasks that
-exited -1, since that's what happens when you stop the replication controller.
+Finally, you can stress test the unit test using the
+[stress command](https://godoc.org/golang.org/x/tools/cmd/stress). Install it
+with this command: 

-Happy flake hunting!
+```sh
+go get golang.org/x/tools/cmd/stress
+```

+Then build your test binary:
+
+```sh
+go test ./pkg/kubelet/config -race -c
+```
+
+Then run it under stress:
+
+```sh
+stress ./config.test -test.run TestInvalidPodFiltered
+```
+
+The stress command runs the test binary repeatedly, reporting when it fails. It
+will periodically report how many times it has run and how many failures have
+occurred. 
+
+You should see output like this:
+
+```
+411 runs so far, 0 failures
+/var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20200825T115041-341977266
+--- FAIL: TestInvalidPodFiltered (0.00s)
+    config_test.go:126: Expected no update in channel, Got types.PodUpdate{Pods:[]*v1.Pod{(*v1.Pod)(0xc00059e400)}, Op:1, Source:"test"}
+FAIL
+ERROR: exit status 1
+815 runs so far, 1 failures
+```
+
+Be careful with tests that use the `net/http/httptest` package; they could
+exhaust the available ports on your system!
+
+## Deflaking integration tests
+
+Integration tests run similarly to unit tests, but they almost always expect a
+running `etcd` instance. You should already have `etcd` installed if you have
+followed the instructions in the [Development Guide](../development.md). Run
+`etcd` in another shell window or tab.
+
+Compile your integration test using a command like this:
+
+```sh
+go test -c -race ./test/integration/endpointslice
+```
+
+And then stress test the flaky test using the `stress` command:
+
+```sh
+stress ./endpointslice.test -test.run TestEndpointSliceMirroring
+```
+
+For an example of a failing or flaky integration test,
+[read this issue](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-678375312). 
+
+Sometimes, but not often, a test will fail due to timeouts caused by
+deadlocks. This can be tracked down by stress testing an entire package. The way
+to track this down is to stress test individual tests in a package. This process
+can take extra effort. Try following these steps:
+
+1. Run each test in the package individually to figure out the average runtime.
+2. Stress each test individually, bounding the timeout to 100 times the average run time.
+3. Isolate the particular test that is deadlocking.
+4. Add debug output to figure out what is causing the deadlock.
+
+Hopefully this can help narrow down exactly where the deadlock is occurring,
+revealing a simple fix!
+
+## Deflaking e2e tests
+
+A flaky [end-to-end (e2e) test](e2e-tests.md) offers its own set of
+challenges. In particular, these tests are difficult because they test the
+entire Kubernetes system. This can be both good and bad. It can be good because
+we want the entire system to work when testing, but an e2e test can also fail
+because of something completely unrelated, such as failing infrastructure or
+misconfigured volumes. Be aware that you can't simply look at the title of an
+e2e test to understand exactly what is being tested. If possible, look for unit
+and integration tests related to the problem you are trying to solve.
+
+### Gathering information
+
+The first step in deflaking an e2e test is to gather information. We capture a
+lot of information from e2e test runs, and you can use these artifacts to gather
+information as to why a test is failing.
+
+Use the [Prow Status](https://prow.k8s.io/) tool to collect information on
+specific test jobs. Drill down into a job and use the **Artifacts** tab to
+collect information. For example, with
+[this particular test job](https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1296558932902285312),
+we can collect the following:
+
+* `build-log.txt`
+* In the control plane directory: `artifacts/e2e-171671cb3f-674b9-master/`
+  * `kube-apiserver-audit.log` (and rotated files)
+  * `kube-apiserver.log`
+  * `kube-controller-manager.log`
+  * `kube-scheduler.log`
+  * And more!
+
+The `artifacts/` directory will contain much more information. From inside the
+directories for each node:
+- `e2e-171671cb3f-674b9-minion-group-drkr`
+- `e2e-171671cb3f-674b9-minion-group-lr2z`
+- `e2e-171671cb3f-674b9-minion-group-qkkz`
+
+Look for these files:
+* `kubelet.log`
+* `docker.log`
+* `kube-proxy.log`
+* And so forth.
+
+### Filtering and correlating information
+
+Once you have gathered your information, the next step is to filter and
+correlate the information. This can require some familiarity with the issue you are tracking
+down, but look first at the relevant components, such as the test log, logs for the API
+server, controller manager, and `kubelet`.
+
+Filter the logs to find events that happened around the time of the failure and
+events that occurred in related namespaces and objects.
+
+The goal is to collate log entries from all of these different files so you can
+get a picture of what was happening in the distributed system. This will help
+you figure out exactly where the e2e test is failing. One tool that may help you
+with this is [k8s-e2e-log-combiner](https://github.com/brianpursley/k8s-e2e-log-combiner)
+
+Kubernetes has a lot of nested systems, so sometimes log entries can refer to
+events happening three levels deep. This means that line numbers in logs might
+not refer to where problems and messages originate. Do not make any assumptions
+about where messages are initiated!
+
+If you have trouble finding relevant logging information or events, don't be
+afraid to add debugging output to the test. For an example of this approach,
+[see this issue](https://github.com/kubernetes/kubernetes/pull/88297#issuecomment-588607417).
+
+### What to look for
+
+One of the first things to look for is if the test is assuming that something is
+running synchronously when it actually runs asynchronously. For example, if the
+test is kicking off a goroutine, you might need to add delays to simulate slow
+operations and reproduce issues.
+
+Examples of the types of changes you could make to try to force a failure:
+  - `time.Sleep(time.Second)` at the top of a goroutine
+  - `time.Sleep(time.Second)` at the beginning of a watch event handler
+  - `time.Sleep(time.Second)` at the end of a watch event handler
+  - `time.Sleep(time.Second)` at the beginning of a sync loop worker
+  - `time.Sleep(time.Second)` at the end of a sync loop worker
+
+Sometimes,
+[such as in this example](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-675631856),
+a test might be causing a race condition with the system it is trying to
+test. Investigate if the test is conflicting with an asynchronous background
+process. To verify the issue, simulate the test losing the race by putting a
+`time.Sleep(time.Second)` between test steps. 
+
+If a test is assuming that an operation will happen quickly, it might not be
+taking into account the configuration of a CI environment. A CI environment will
+generally be more resource-constrained and will run multiple tests in
+parallel. If it runs in less than a second locally, it could take a few seconds
+in a CI environment.
+
+Unless your test is specifically testing performance/timing, don't set tight
+timing tolerances. Use `wait.ForeverTestTimeout`, which is a reasonable stand-in
+for operations that should not take very long. This is a better approach than
+polling for 1 to 10 seconds.
+
+Is the test incorrectly assuming deterministic output? Remember that map iteration in go is
+non-deterministic. If there is a list being compiled or a set of steps are being
+performed by iterating over a map, they will not be completed in a predictable
+order. Make sure the test is able to tolerate any order in a map.
+
+Be aware that if a test is mixing random allocation with static allocation, that
+there will be intermittent conflicts.
+
+Finally, if you are using a fake client with a watcher, it can relist/rewatch at any point.
+It is better to look for specific actions in the fake client rather than
+asserting exact content of the full set.