From be3986f7d043048bb639c3152da2373a4001c6b2 Mon Sep 17 00:00:00 2001 From: Aaron Crickenberger Date: Thu, 12 Dec 2019 16:23:18 -0800 Subject: [PATCH] Add some suggestions on how to handle flakes based on looking at deflaking PRs from the past few months, what are some common anti-patterns or fixes? --- contributors/devel/sig-testing/flaky-tests.md | 66 ++++++++++++++++--- 1 file changed, 58 insertions(+), 8 deletions(-) diff --git a/contributors/devel/sig-testing/flaky-tests.md b/contributors/devel/sig-testing/flaky-tests.md index f23272b1e..0ca4e9a08 100644 --- a/contributors/devel/sig-testing/flaky-tests.md +++ b/contributors/devel/sig-testing/flaky-tests.md @@ -5,20 +5,70 @@ all tests are green, and we have a number of different CI systems running the tests in various combinations, even a small percentage of flakes results in a lot of pain for people waiting for their PRs to merge. -Therefore, it's very important that we write tests defensively. Situations that -"almost never happen" happen with some regularity when run thousands of times in -resource-constrained environments. Since flakes can often be quite hard to -reproduce while still being common enough to block merges occasionally, it's -additionally important that the test logs be useful for narrowing down exactly -what caused the failure. +Therefore, it's important we take flakes seriously. We should avoid flakes by +writing our tests defensively. When flakes are identified, we should prioritize +addressing them, either by fixing them or quarantining them off the critical +path. -Note that flakes can occur in unit tests, integration tests, or end-to-end -tests, but probably occur most commonly in end-to-end tests. +# Avoiding Flakes + +Write tests defensively. Remember that "almost never" happens all the time when +tests are run thousands of times in a CI environment. Tests need to be tolerant +of other tests running concurrently, resource contention, and things taking +longer than expected. + +There is a balance to be had here. Don't log too much, but don't log too little. +Don't assume things will succeed after a fixed delay, but don't wait forever. + +- Ensure the test functions in parallel with other tests + - Be specific enough to ensure a test isn't thrown off by other tests' assets + - https://github.com/kubernetes/kubernetes/pull/85849 - eg: ensure resource name and namespace match + - https://github.com/kubernetes/kubernetes/pull/85967 - eg: tolerate errors for non k8s.io APIs + - https://github.com/kubernetes/kubernetes/pull/85619 - eg: tolerate multiple storage plugins +- Ensure the test functions in a resource constrained environment + - Only ask for the resources you need + - https://github.com/kubernetes/kubernetes/pull/84975 - eg: drop memory constraints for test cases that only need cpu + - Don't use overly tight deadlines (but not overly broad either, non-[Slow] tests timeout after 5min) + - https://github.com/kubernetes/kubernetes/pull/85847 - eg: poll for `wait.ForeverTestTimeout` instead of 10s + - https://github.com/kubernetes/kubernetes/pull/84238 - eg: poll for 2min instead of 1min + - mark tests as [Slow] if they are unable to pass within 5min + - Do not expect actions to happen instantaneously or after a fixed delay + - Prefer informers and wait loops +- Ensure the test provides sufficient context in logs for forensic debugging + - Explain what the test is doing, eg: + - "creating a foo with invalid configuration" + - "patching the foo to have a bar" + - Explain what specific check failed, and how, eg: + - "failed to create resource foo in namespace bar because of err" + - "expected all items to be deleted, but items foo, bar, and baz remain" + - Explain why a polling loop is failing, eg: + - "expected 3 widgets, found 2, will retry" + - "expected pod to be in state foo, currently in state bar, will retry" + +# Quarantining Flakes + +- When quarantining a presubmit test, ensure an issue exists in the current + release milestone assigned to the owning SIG. The issue should be labeled + `priority/critical-urgent`, `lifecycle/frozen`, and `kind/flake`. The + expectation is for the owning SIG to resolve the flakes and reintroduce the + test, or determine the tested functionality is covered via another method + and delete the test in question. +- Quarantine a single test case by adding `[Flaky]` to the test name in question, + most CI jobs exclude these tests. This makes the most sense for flakes that + are merge-blocking and taking too long to troubleshoot, or occurring across + multiple jobs. - eg: https://github.com/kubernetes/kubernetes/pull/83792 + - eg: https://github.com/kubernetes/kubernetes/pull/86327 +- Quarantine an entire set of tests by adding `[Feature:Foo]` to the test(s) in + question. This will require creating jobs that focus specifically on this + feature. The majority of release-blocking and merge-blocking suites avoid + these jobs unless they're proven to be non-flaky. # Hunting Flakes We offer the following tools to aid in finding or troubleshooting flakes +- [flakes-latest.json](http://storage.googleapis.com/k8s-metrics/flakes-latest.json) + - shows the top 10 flakes over the past week for all PR jobs - [go.k8s.io/triage] - an interactive test failure report providing filtering and drill-down by job name, test name, failure text for failures in the last two weeks - https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24 - all failures that happened in the `pull-kubernetes-e2e-gce` job - https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=timed%20out - all failures containing the text `timed out`