From 9f473f6e971593535f69547ccb933ce08033496e Mon Sep 17 00:00:00 2001
From: Phillip Wittrock <pwittroc@google.com>
Date: Tue, 20 Jun 2017 13:04:19 -0700
Subject: [PATCH] Document a process for keeping tests passing against master

---
 contributors/devel/release/testing.md | 180 ++++++++++++++++++++++++++
 1 file changed, 180 insertions(+)
 create mode 100644 contributors/devel/release/testing.md

diff --git a/contributors/devel/release/testing.md b/contributors/devel/release/testing.md
new file mode 100644
index 000000000..8e8b6d948
--- /dev/null
+++ b/contributors/devel/release/testing.md
@@ -0,0 +1,180 @@
+# Kubernetes test sustaining engineering
+
+This document describes how Kubernetes automated tests are maintained as part
+of the development process.
+
+## Definitions
+
+The following definitions are for tests continuously run as part of CI.
+
+- *test*
+  - *artifact*: a row in [test grid]
+  - a single test run as part of a test job
+  - maybe either an e2e test or an integration / unit test
+- *test job*
+  - *artifact*: a tab in [test grid]
+  - a collection of tests that are run together in shared environment. may:
+    - run in a specific environment - e.g. [gce, gke, aws], [cvm, gci]
+    - run under specific conditions - e.g. []upgrade, version skew, soak, serial]
+    - test a specific component - e.g. [federation, node]
+- *test infrastructure*
+  - not directly shown in the test grid
+  - libraries and infrastructure common across tests
+- *test failure*
+  - persistently failing test runs for a given test
+- *test flake*
+  - periodically failling test runs for a given test
+  
+## Ownership
+
+Each test must have an escalation point (email + slack).  The escalation point is responsible for
+keeping the test healthy.  Fixes for test failures caused by areas of ownership outside the
+responsibility of the escalation point should be coordinated with other teams by the
+test escalation point.
+
+Escalation points are expected to be responsive within 24 hours, and prioritize test failure
+issues over other issues.
+
+### test
+
+Each test must have an owning SIG or group that serves as the escalation point for flakes and failures.
+The name of the owner should be present in the test name so that it is displayed in the test grid.
+
+Owners are expected to maintain a dashboard of the tests that they own and
+maintain the test health.
+
+**Note:** e2e test owners are present in the test name
+
+### test job
+
+Each test job must have an owning SIG or group that is responsible for the health of the test job.  The
+owner may also serve as an escalation point for issues impacting a test only in that specific test job
+(passing in other test jobs).  e.g. If a test only fails on aws or only on gke test jobs, the test job
+owner and test owner must identify the owner for resolving the failure.
+
+Owners of test jobs are expected to maintain a dashboard of the test jobs they own and
+maintain the test job health.
+
+SIGs should update the [job/config] and mark the tests that they own.
+
+### test infrastructure
+
+Issues with underlying test infrastructure (e.g. prow) should be escalated to sig/testing.
+
+## Monitoring project wide test health
+
+Dashboards for Kubernetes release blocking test are present on the [test grid].
+
+The following dashboards are expected to remain healthy throughout the development cycle.
+
+- [release-master-blocking](https://k8s-testgrid.appspot.com/release-master-blocking)
+  - Tests run against the master branch
+- [1.7-master-upgrade & 1.6-master-upgrade](https://k8s-testgrid.appspot.com/master-upgrade)
+  - Upgrade a cluster from 1.7 to the master branch and run tests
+  - Upgrade a cluster from 1.6 to the master branch and run tests
+- [1.7-master-kubectl-skew](https://k8s-testgrid.appspot.com/master-kubectl-skew)
+  - Run tests skewing the master and kubectl by +1/-1 version
+
+## Triaging ownership for test failures
+
+When a test is failing, it must be quickly escalated to the correct owner.  Tests that
+are left to fail for days or weeks become toxic and create noise in the system health
+metrics.
+
+The [build cop] is expected to ensure that the release blocking tests remain
+perpetually healthy by monitoring the test grid and escalating failures.
+
+On test failures, the build cop will follow the [sig escalation](#sig-test-escalation) path.
+
+*Tests without a responsive owner should be assigned a new owner or disabled.*
+
+### test failure
+
+A test is failing.
+
+*Symptom*: A row in the test grid is consistently failing across multiple jobs
+
+*How to check for symptom*: Go to the [triage tool], and
+search for the failing test by name.  Check to see if it is failing across
+multiple jobs, or just one.
+
+*Action*: Escalate to the owning SIG present in the test name (e.g. SIG-cli)
+
+### test job failure
+
+A test *job* is unhealthy causing multiple unrelated tests to fail.
+
+*Symptom*: Multiple unrelated rows in the test grid are consistently failing in a single job,
+but passing in others jobs.
+
+*How to check for symptom*: Go to the [test grid].  Are a bunch of tests failing or just a couple?  Are
+those tests passing on other jobs?
+
+*Action*: Escalate to the owning SIG for the test job.
+
+### test failure (only on specifics job)
+
+A test is failing, but only on specific jobs.
+
+*Symptom*: A row in the test grid is consistently failing on a single job, but passing on other jobs.
+
+*How to check for symptom*: Go to the [triage tool], and
+search for the failing test by name.  Check to see if it is failing across
+multiple jobs, or just one.
+
+*Action*: Escalate to the owning SIG present in the test name (e.g. SIG-cli).  They
+will coordinate a fix with the test job owner.
+
+## Triaging ownership for test flakes
+
+To triage ownership flakes, follow the same escalation process for failures.  Flakes are considered less
+urgent than persistent failures, but still expected to have a root cause investigation within 1 week.
+
+## Broken test workflow
+
+SIGs are expected to proactively monitor and maintain their tests.  The build cop will also
+monitor the health of the entire project, but is intended as backup who will escalate
+failures to the owning SIGs.
+
+- File an issue for the broken test so it can be referenced and discovered
+  - Set the following labels: `priority/failing-test`, `sig/*`
+  - Assign the issue to whoever is working on it
+  - Mention the current build cop (TODO: publish this somewhere)
+- Root cause analysis of the test failure is performed by the owner
+- **Note**: The owning SIG for a test can reassign ownership of a resolution to another SIG only after getting
+  approval from that SIG
+  - This is done by the target SIG reassigning to themselves, not the test owning SIG assigning to someone else.
+- Tests failure is resolved either by fixing the underlying issue or disabling the test
+  - Disabling a test maybe the correct thing to do in some cases - such as upgrade tests running e2e tests for alpha
+    features disable in newer releases.
+- SIG owner monitors the test grid to make sure the tests begin to pass
+- SIG owner closes the issue
+
+## SIG test escalation
+
+The build cop should monitor the overall test health of the project, and ensure ownership for any given
+test does not fall through the cracks.  When the build cop observer a test failure, they should first
+search to see if an issue has been filed already, and if not (optionally file an issue and) escalate to the SIG
+escalation point.  If the escalation point is unresponsive within a day, the build cop should escalate to the SIG
+googlegroup and/or slack channel, mentioning the SIG leads.  If escalation through the SIG googlegroup,
+slack channel and SIG leads is unsuccessful, the build cop should escalate to SIG release through the
+googlegroup and slack - mentioning the SIG leads.
+
+The SIG escalation points should be bootstrapped from the [community sig list].
+
+## SIG Recommendations
+
+- Figure out which e2e test jobs are release blocking for your SIG.
+- Develop a process for making sure the SIGs test grid remains healthy and resolving test failures.
+- Consider moving the e2e tests for the SIG into their own test jobs if this would make maintaining them easier.
+- Consider developing a playbook for how to resolve test failures and how do identify whether or not another SIG owns the resolution of the issue.
+
+[community sig list]: (https://github.com/kubernetes/community/blob/master/sig-list.md)
+[triage tool]: (https://storage.googleapis.com/k8s-gubernator/triage/index.html)
+[test grid]: (https://k8s-testgrid.appspot.com/)
+[build cop]: (https://github.com/kubernetes/community/blob/master/contributors/devel/on-call-build-cop.md)
+[release-master-blocking]: (https://k8s-testgrid.appspot.com/release-master-blocking#Summary)
+[1.7-master-upgrade]: (https://k8s-testgrid.appspot.com/1.7-master-upgrade#Summary)
+[1.6-master-upgrade]: (https://k8s-testgrid.appspot.com/1.6-master-upgrade#Summary)
+[1.7-master-kubectl-skew]: (https://k8s-testgrid.appspot.com/1.6-1.7-kubectl-skew)
+[job/config]: (https://github.com/kubernetes/test-infra/blob/master/jobs/config.json)