Add Deflaking tests information to Developer Guide

This updates flaky-tests.md with all of the information on finding and deflaking tests from the presentation to SIG Testing found here: https://www.youtube.com/watch?v=Ewp8LNY_qTg Also, this drops the outdated "Hunting flaky unit tests" section from flaky-tests.md. Co-authored-by: Aaron Crickenberger <spiffxp@google.com>
2020-10-08 15:23:56 -07:00 · 2020-10-08 15:23:56 -07:00 · 15de14640b
parent c2fa487203
commit 15de14640b
1 changed files with 238 additions and 94 deletions
--- a/contributors/devel/sig-testing/flaky-tests.md
+++ b/contributors/devel/sig-testing/flaky-tests.md
@ -1,4 +1,4 @@
-# Flaky tests
+# Flaky Tests
 Any test that fails occasionally is "flaky". Since our merges only proceed when
 all tests are green, and we have a number of different CI systems running the
@ -10,7 +10,27 @@ writing our tests defensively. When flakes are identified, we should prioritize
 addressing them, either by fixing them or quarantining them off the critical
 path.
-# Avoiding Flakes
+For more information about deflaking Kubernetes tests, watch @liggitt's
 [presentation from Kubernetes SIG Testing - 2020-08-25](https://www.youtube.com/watch?v=Ewp8LNY_qTg).
 **Table of Contents**
 - [Flaky Tests](#flaky-tests)
  - [Avoiding Flakes](#avoiding-flakes)
  - [Quarantining Flakes](#quarantining-flakes)
  - [Hunting Flakes](#hunting-flakes)
  - [GitHub Issues for Known Flakes](#github-issues-for-known-flakes)
    - [Expectations when a flaky test is assigned to you](#expectations-when-a-flaky-test-is-assigned-to-you)
    - [Writing a good flake report](#writing-a-good-flake-report)
  - [Deflaking unit tests](#deflaking-unit-tests)
  - [Deflaking integration tests](#deflaking-integration-tests)
  - [Deflaking e2e tests](#deflaking-e2e-tests)
    - [Gathering information](#gathering-information)
    - [Filtering and correlating information](#filtering-and-correlating-information)
    - [What to look for](#what-to-look-for)
  - [Hunting flaky unit tests in Kubernetes](#hunting-flaky-unit-tests-in-kubernetes)
 ## Avoiding Flakes
 Write tests defensively. Remember that "almost never" happens all the time when
 tests are run thousands of times in a CI environment. Tests need to be tolerant
@ -45,7 +65,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever.
    - "expected 3 widgets, found 2, will retry"
    - "expected pod to be in state foo, currently in state bar, will retry"
-# Quarantining Flakes
+## Quarantining Flakes
 - When quarantining a presubmit test, ensure an issue exists in the current
  release milestone assigned to the owning SIG. The issue should be labeled
@ -63,7 +83,7 @@ Don't assume things will succeed after a fixed delay, but don't wait forever.
  feature. The majority of release-blocking and merge-blocking suites avoid
  these jobs unless they're proven to be non-flaky.
-# Hunting Flakes
+## Hunting Flakes
 We offer the following tools to aid in finding or troubleshooting flakes
@ -82,7 +102,7 @@ We offer the following tools to aid in finding or troubleshooting flakes
 [go.k8s.io/triage]: https://go.k8s.io/triage
 [testgrid.k8s.io]: https://testgrid.k8s.io
-# GitHub Issues for Known Flakes
+## GitHub Issues for Known Flakes
 Because flakes may be rare, it's very important that all relevant logs be
 discoverable from the issue.
@ -105,7 +125,7 @@ flakes is a quick way to gain expertise and community goodwill.
 [flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake
-## Expectations when a flaky test is assigned to you
+### Expectations when a flaky test is assigned to you
 Note that we won't randomly assign these issues to you unless you've opted in or
 you're part of a group that has opted in. We are more than happy to accept help
@ -140,113 +160,237 @@ release-blocking flakes. Therefore we have the following guidelines:
   name, which will result in the test being quarantined to only those jobs that
   explicitly run flakes (eg: https://testgrid.k8s.io/google-gce#gci-gce-flaky)
-# Reproducing unit test flakes
+### Writing a good flake report
-Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
+If you are reporting a flake, it is important to include enough information for
 others to reproduce the issue. When filing the issue, use the
 [flaking test template](https://github.com/kubernetes/kubernetes/issues/new?labels=kind%2Fflake&template=flaking-test.md). In
 your issue, answer these following questions:
-Just
+- Is this flaking in multiple jobs? You can search for the flaking test or error
  messages using the
  [Kubernetes Aggregated Test Results](http://go.k8s.io/triage) tool.
 - Are there multiple tests in the same package or suite failing with the same apparent error?
-```
+In addition, be sure to include the following information:
 $ go install golang.org/x/tools/cmd/stress
 ```
-Then build your test binary
+- A link to [testgrid](https://testgrid.k8s.io/) history for the flaking test's
  jobs, filtered to the relevant tests 
 - The failed test output &mdash; this is essential because it makes the issue searchable
 - A link to the triage query
 - A link to specific failures
 - Be sure to tag the relevant SIG, if you know what it is.
-```
+For a good example of a flaking test issue,
-$ go test -c -race
+[check here](https://github.com/kubernetes/kubernetes/issues/93358).
 ```
-Then run it under stress
+([TODO](https://github.com/kubernetes/kubernetes/issues/95528): Move these instructions to the issue template.)
-```
+## Deflaking unit tests
 $ stress ./package.test -test.run=FlakyTest
 ```
-It runs the command and writes output to `/tmp/gostress-*` files when it fails.
+To get started with deflaking unit tests, you will need to first
-It periodically reports with run counts. Be careful with tests that use the
+reproduce the flaky behavior. Start with a simple attempt to just run
-`net/http/httptest` package; they could exhaust the available ports on your
+the flaky unit test. For example: 
 system!
 # Hunting flaky unit tests in Kubernetes
 Sometimes unit tests are flaky. This means that due to (usually) race
 conditions, they will occasionally fail, even though most of the time they pass.
 We have a goal of 99.9% flake free tests. This means that there is only one
 flake in one thousand runs of a test.
 Running a test 1000 times on your own machine can be tedious and time consuming.
 Fortunately, there is a better way to achieve this using Kubernetes.
 _Note: these instructions are mildly hacky for now, as we get run once semantics
 and logging they will get better_
 There is a testing image `brendanburns/flake` up on the docker hub. We will use
 this image to test our fix.
 Create a replication controller with the following config:
 ```yaml
 apiVersion: v1
 kind: ReplicationController
 metadata:
  name: flakecontroller
 spec:
  replicas: 24
  template:
    metadata:
      labels:
        name: flake
    spec:
      containers:
      - name: flake
        image: brendanburns/flake
        env:
        - name: TEST_PACKAGE
          value: pkg/tools
        - name: REPO_SPEC
          value: https://github.com/kubernetes/kubernetes
 ```
 Note that we omit the labels and the selector fields of the replication
 controller, because they will be populated from the labels field of the pod
 template by default.
 ```sh
-kubectl create -f ./controller.yaml
+go test ./pkg/kubelet/config -run TestInvalidPodFiltered
 ```
-This will spin up 24 instances of the test. They will run to completion, then
+Also make sure that you bypass the `go test` cache by using an uncachable
-exit, and the kubelet will restart them, accumulating more and more runs of the
+command line option:
 test.
 You can examine the recent runs of the test by calling `docker ps -a` and
 looking for tasks that exited with non-zero exit codes. Unfortunately, docker
 ps -a only keeps around the exit status of the last 15-20 containers with the
 same image, so you have to check them frequently.
 You can use this script to automate checking for failures, assuming your cluster
 is running on GCE and has four nodes:
 ```sh
-echo "" > output.txt
+go test ./pkg/kubelet/config -count=1 -run TestInvalidPodFiltered
 for i in {1..4}; do
  echo "Checking kubernetes-node-${i}"
  echo "kubernetes-node-${i}:" >> output.txt
  gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt
 done
 grep "Exited ([^0])" output.txt
 ```
-Eventually you will have sufficient runs for your purposes. At that point you
+If even this is not revealing issues with the flaky test, try running with
-can delete the replication controller by running:
+[race detection](https://golang.org/doc/articles/race_detector.html) enabled:
 ```sh
-kubectl delete replicationcontroller flakecontroller
+go test ./pkg/kubelet/config -race -count=1 -run TestInvalidPodFiltered
 ```
-If you do a final check for flakes with `docker ps -a`, ignore tasks that
+Finally, you can stress test the unit test using the
-exited -1, since that's what happens when you stop the replication controller.
+[stress command](https://godoc.org/golang.org/x/tools/cmd/stress). Install it
 with this command: 
-Happy flake hunting!
+```sh
 go get golang.org/x/tools/cmd/stress
 ```
 Then build your test binary:
 ```sh
 go test ./pkg/kubelet/config -race -c
 ```
 Then run it under stress:
 ```sh
 stress ./config.test -test.run TestInvalidPodFiltered
 ```
 The stress command runs the test binary repeatedly, reporting when it fails. It
 will periodically report how many times it has run and how many failures have
 occurred. 
 You should see output like this:
 ```
 411 runs so far, 0 failures
 /var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20200825T115041-341977266
 --- FAIL: TestInvalidPodFiltered (0.00s)
    config_test.go:126: Expected no update in channel, Got types.PodUpdate{Pods:[]*v1.Pod{(*v1.Pod)(0xc00059e400)}, Op:1, Source:"test"}
 FAIL
 ERROR: exit status 1
 815 runs so far, 1 failures
 ```
 Be careful with tests that use the `net/http/httptest` package; they could
 exhaust the available ports on your system!
 ## Deflaking integration tests
 Integration tests run similarly to unit tests, but they almost always expect a
 running `etcd` instance. You should already have `etcd` installed if you have
 followed the instructions in the [Development Guide](../development.md). Run
 `etcd` in another shell window or tab.
 Compile your integration test using a command like this:
 ```sh
 go test -c -race ./test/integration/endpointslice
 ```
 And then stress test the flaky test using the `stress` command:
 ```sh
 stress ./endpointslice.test -test.run TestEndpointSliceMirroring
 ```
 For an example of a failing or flaky integration test,
 [read this issue](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-678375312). 
 Sometimes, but not often, a test will fail due to timeouts caused by
 deadlocks. This can be tracked down by stress testing an entire package. The way
 to track this down is to stress test individual tests in a package. This process
 can take extra effort. Try following these steps:
 1. Run each test in the package individually to figure out the average runtime.
 2. Stress each test individually, bounding the timeout to 100 times the average run time.
 3. Isolate the particular test that is deadlocking.
 4. Add debug output to figure out what is causing the deadlock.
 Hopefully this can help narrow down exactly where the deadlock is occurring,
 revealing a simple fix!
 ## Deflaking e2e tests
 A flaky [end-to-end (e2e) test](e2e-tests.md) offers its own set of
 challenges. In particular, these tests are difficult because they test the
 entire Kubernetes system. This can be both good and bad. It can be good because
 we want the entire system to work when testing, but an e2e test can also fail
 because of something completely unrelated, such as failing infrastructure or
 misconfigured volumes. Be aware that you can't simply look at the title of an
 e2e test to understand exactly what is being tested. If possible, look for unit
 and integration tests related to the problem you are trying to solve.
 ### Gathering information
 The first step in deflaking an e2e test is to gather information. We capture a
 lot of information from e2e test runs, and you can use these artifacts to gather
 information as to why a test is failing.
 Use the [Prow Status](https://prow.k8s.io/) tool to collect information on
 specific test jobs. Drill down into a job and use the **Artifacts** tab to
 collect information. For example, with
 [this particular test job](https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1296558932902285312),
 we can collect the following:
 * `build-log.txt`
 * In the control plane directory: `artifacts/e2e-171671cb3f-674b9-master/`
  * `kube-apiserver-audit.log` (and rotated files)
  * `kube-apiserver.log`
  * `kube-controller-manager.log`
  * `kube-scheduler.log`
  * And more!
 The `artifacts/` directory will contain much more information. From inside the
 directories for each node:
 - `e2e-171671cb3f-674b9-minion-group-drkr`
 - `e2e-171671cb3f-674b9-minion-group-lr2z`
 - `e2e-171671cb3f-674b9-minion-group-qkkz`
 Look for these files:
 * `kubelet.log`
 * `docker.log`
 * `kube-proxy.log`
 * And so forth.
 ### Filtering and correlating information
 Once you have gathered your information, the next step is to filter and
 correlate the information. This can require some familiarity with the issue you are tracking
 down, but look first at the relevant components, such as the test log, logs for the API
 server, controller manager, and `kubelet`.
 Filter the logs to find events that happened around the time of the failure and
 events that occurred in related namespaces and objects.
 The goal is to collate log entries from all of these different files so you can
 get a picture of what was happening in the distributed system. This will help
 you figure out exactly where the e2e test is failing. One tool that may help you
 with this is [k8s-e2e-log-combiner](https://github.com/brianpursley/k8s-e2e-log-combiner)
 Kubernetes has a lot of nested systems, so sometimes log entries can refer to
 events happening three levels deep. This means that line numbers in logs might
 not refer to where problems and messages originate. Do not make any assumptions
 about where messages are initiated!
 If you have trouble finding relevant logging information or events, don't be
 afraid to add debugging output to the test. For an example of this approach,
 [see this issue](https://github.com/kubernetes/kubernetes/pull/88297#issuecomment-588607417).
 ### What to look for
 One of the first things to look for is if the test is assuming that something is
 running synchronously when it actually runs asynchronously. For example, if the
 test is kicking off a goroutine, you might need to add delays to simulate slow
 operations and reproduce issues.
 Examples of the types of changes you could make to try to force a failure:
  - `time.Sleep(time.Second)` at the top of a goroutine
  - `time.Sleep(time.Second)` at the beginning of a watch event handler
  - `time.Sleep(time.Second)` at the end of a watch event handler
  - `time.Sleep(time.Second)` at the beginning of a sync loop worker
  - `time.Sleep(time.Second)` at the end of a sync loop worker
 Sometimes,
 [such as in this example](https://github.com/kubernetes/kubernetes/issues/93496#issuecomment-675631856),
 a test might be causing a race condition with the system it is trying to
 test. Investigate if the test is conflicting with an asynchronous background
 process. To verify the issue, simulate the test losing the race by putting a
 `time.Sleep(time.Second)` between test steps. 
 If a test is assuming that an operation will happen quickly, it might not be
 taking into account the configuration of a CI environment. A CI environment will
 generally be more resource-constrained and will run multiple tests in
 parallel. If it runs in less than a second locally, it could take a few seconds
 in a CI environment.
 Unless your test is specifically testing performance/timing, don't set tight
 timing tolerances. Use `wait.ForeverTestTimeout`, which is a reasonable stand-in
 for operations that should not take very long. This is a better approach than
 polling for 1 to 10 seconds.
 Is the test incorrectly assuming deterministic output? Remember that map iteration in go is
 non-deterministic. If there is a list being compiled or a set of steps are being
 performed by iterating over a map, they will not be completed in a predictable
 order. Make sure the test is able to tolerate any order in a map.
 Be aware that if a test is mixing random allocation with static allocation, that
 there will be intermittent conflicts.
 Finally, if you are using a fake client with a watcher, it can relist/rewatch at any point.
 It is better to look for specific actions in the fake client rather than
 asserting exact content of the full set.