From be3986f7d043048bb639c3152da2373a4001c6b2 Mon Sep 17 00:00:00 2001
From: Aaron Crickenberger <spiffxp@google.com>
Date: Thu, 12 Dec 2019 16:23:18 -0800
Subject: [PATCH] Add some suggestions on how to handle flakes

based on looking at deflaking PRs from the past few months,
what are some common anti-patterns or fixes?
---
 contributors/devel/sig-testing/flaky-tests.md | 66 ++++++++++++++++---
 1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/contributors/devel/sig-testing/flaky-tests.md b/contributors/devel/sig-testing/flaky-tests.md
index f23272b1e..0ca4e9a08 100644
--- a/contributors/devel/sig-testing/flaky-tests.md
+++ b/contributors/devel/sig-testing/flaky-tests.md
@@ -5,20 +5,70 @@ all tests are green, and we have a number of different CI systems running the
 tests in various combinations, even a small percentage of flakes results in a
 lot of pain for people waiting for their PRs to merge.
 
-Therefore, it's very important that we write tests defensively. Situations that
-"almost never happen" happen with some regularity when run thousands of times in
-resource-constrained environments. Since flakes can often be quite hard to
-reproduce while still being common enough to block merges occasionally, it's
-additionally important that the test logs be useful for narrowing down exactly
-what caused the failure.
+Therefore, it's important we take flakes seriously. We should avoid flakes by
+writing our tests defensively. When flakes are identified, we should prioritize
+addressing them, either by fixing them or quarantining them off the critical
+path.
 
-Note that flakes can occur in unit tests, integration tests, or end-to-end
-tests, but probably occur most commonly in end-to-end tests.
+# Avoiding Flakes
+
+Write tests defensively. Remember that "almost never" happens all the time when
+tests are run thousands of times in a CI environment. Tests need to be tolerant
+of other tests running concurrently, resource contention, and things taking
+longer than expected.
+
+There is a balance to be had here. Don't log too much, but don't log too little.
+Don't assume things will succeed after a fixed delay, but don't wait forever.
+
+- Ensure the test functions in parallel with other tests
+  - Be specific enough to ensure a test isn't thrown off by other tests' assets
+    - https://github.com/kubernetes/kubernetes/pull/85849 - eg: ensure resource name and namespace match
+    - https://github.com/kubernetes/kubernetes/pull/85967 - eg: tolerate errors for non k8s.io APIs
+    - https://github.com/kubernetes/kubernetes/pull/85619 - eg: tolerate multiple storage plugins
+- Ensure the test functions in a resource constrained environment
+  - Only ask for the resources you need
+    - https://github.com/kubernetes/kubernetes/pull/84975 - eg: drop memory constraints for test cases that only need cpu
+  - Don't use overly tight deadlines (but not overly broad either, non-[Slow] tests timeout after 5min)
+    - https://github.com/kubernetes/kubernetes/pull/85847 - eg: poll for `wait.ForeverTestTimeout` instead of 10s
+    - https://github.com/kubernetes/kubernetes/pull/84238 - eg: poll for 2min instead of 1min
+    - mark tests as [Slow] if they are unable to pass within 5min
+  - Do not expect actions to happen instantaneously or after a fixed delay
+  - Prefer informers and wait loops
+- Ensure the test provides sufficient context in logs for forensic debugging
+  - Explain what the test is doing, eg:
+    - "creating a foo with invalid configuration"
+    - "patching the foo to have a bar"
+  - Explain what specific check failed, and how, eg:
+    - "failed to create resource foo in namespace bar because of err"
+    - "expected all items to be deleted, but items foo, bar, and baz remain"
+  - Explain why a polling loop is failing, eg: 
+    - "expected 3 widgets, found 2, will retry"
+    - "expected pod to be in state foo, currently in state bar, will retry"
+
+# Quarantining Flakes
+
+- When quarantining a presubmit test, ensure an issue exists in the current
+  release milestone assigned to the owning SIG. The issue should be labeled
+  `priority/critical-urgent`, `lifecycle/frozen`, and `kind/flake`. The
+  expectation is for the owning SIG to resolve the flakes and reintroduce the
+  test, or determine the tested functionality is covered via another method
+  and delete the test in question.
+- Quarantine a single test case by adding `[Flaky]` to the test name in question,
+  most CI jobs exclude these tests. This makes the most sense for flakes that
+  are merge-blocking and taking too long to troubleshoot, or occurring across
+  multiple jobs.    - eg: https://github.com/kubernetes/kubernetes/pull/83792
+  - eg: https://github.com/kubernetes/kubernetes/pull/86327
+- Quarantine an entire set of tests by adding `[Feature:Foo]` to the test(s) in
+  question. This will require creating jobs that focus specifically on this
+  feature. The majority of release-blocking and merge-blocking suites avoid
+  these jobs unless they're proven to be non-flaky.
 
 # Hunting Flakes
 
 We offer the following tools to aid in finding or troubleshooting flakes
 
+- [flakes-latest.json](http://storage.googleapis.com/k8s-metrics/flakes-latest.json)
+  - shows the top 10 flakes over the past week for all PR jobs
 - [go.k8s.io/triage] - an interactive test failure report providing filtering and drill-down by job name, test name, failure text for failures in the last two weeks
   - https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24 - all failures that happened in the `pull-kubernetes-e2e-gce` job
   - https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=timed%20out - all failures containing the text `timed out`