Merge pull request #744 from pwittrock/governance
Document a process for keeping tests passing against master
This commit is contained in:
commit
d63e9f08f6
|
|
@ -0,0 +1,180 @@
|
||||||
|
# Kubernetes test sustaining engineering
|
||||||
|
|
||||||
|
This document describes how Kubernetes automated tests are maintained as part
|
||||||
|
of the development process.
|
||||||
|
|
||||||
|
## Definitions
|
||||||
|
|
||||||
|
The following definitions are for tests continuously run as part of CI.
|
||||||
|
|
||||||
|
- *test*
|
||||||
|
- *artifact*: a row in [test grid]
|
||||||
|
- a single test run as part of a test job
|
||||||
|
- maybe either an e2e test or an integration / unit test
|
||||||
|
- *test job*
|
||||||
|
- *artifact*: a tab in [test grid]
|
||||||
|
- a collection of tests that are run together in shared environment. may:
|
||||||
|
- run in a specific environment - e.g. [gce, gke, aws], [cvm, gci]
|
||||||
|
- run under specific conditions - e.g. []upgrade, version skew, soak, serial]
|
||||||
|
- test a specific component - e.g. [federation, node]
|
||||||
|
- *test infrastructure*
|
||||||
|
- not directly shown in the test grid
|
||||||
|
- libraries and infrastructure common across tests
|
||||||
|
- *test failure*
|
||||||
|
- persistently failing test runs for a given test
|
||||||
|
- *test flake*
|
||||||
|
- periodically failling test runs for a given test
|
||||||
|
|
||||||
|
## Ownership
|
||||||
|
|
||||||
|
Each test must have an escalation point (email + slack). The escalation point is responsible for
|
||||||
|
keeping the test healthy. Fixes for test failures caused by areas of ownership outside the
|
||||||
|
responsibility of the escalation point should be coordinated with other teams by the
|
||||||
|
test escalation point.
|
||||||
|
|
||||||
|
Escalation points are expected to be responsive within 24 hours, and prioritize test failure
|
||||||
|
issues over other issues.
|
||||||
|
|
||||||
|
### test
|
||||||
|
|
||||||
|
Each test must have an owning SIG or group that serves as the escalation point for flakes and failures.
|
||||||
|
The name of the owner should be present in the test name so that it is displayed in the test grid.
|
||||||
|
|
||||||
|
Owners are expected to maintain a dashboard of the tests that they own and
|
||||||
|
maintain the test health.
|
||||||
|
|
||||||
|
**Note:** e2e test owners are present in the test name
|
||||||
|
|
||||||
|
### test job
|
||||||
|
|
||||||
|
Each test job must have an owning SIG or group that is responsible for the health of the test job. The
|
||||||
|
owner may also serve as an escalation point for issues impacting a test only in that specific test job
|
||||||
|
(passing in other test jobs). e.g. If a test only fails on aws or only on gke test jobs, the test job
|
||||||
|
owner and test owner must identify the owner for resolving the failure.
|
||||||
|
|
||||||
|
Owners of test jobs are expected to maintain a dashboard of the test jobs they own and
|
||||||
|
maintain the test job health.
|
||||||
|
|
||||||
|
SIGs should update the [job/config] and mark the tests that they own.
|
||||||
|
|
||||||
|
### test infrastructure
|
||||||
|
|
||||||
|
Issues with underlying test infrastructure (e.g. prow) should be escalated to sig/testing.
|
||||||
|
|
||||||
|
## Monitoring project wide test health
|
||||||
|
|
||||||
|
Dashboards for Kubernetes release blocking test are present on the [test grid].
|
||||||
|
|
||||||
|
The following dashboards are expected to remain healthy throughout the development cycle.
|
||||||
|
|
||||||
|
- [release-master-blocking](https://k8s-testgrid.appspot.com/release-master-blocking)
|
||||||
|
- Tests run against the master branch
|
||||||
|
- [1.7-master-upgrade & 1.6-master-upgrade](https://k8s-testgrid.appspot.com/master-upgrade)
|
||||||
|
- Upgrade a cluster from 1.7 to the master branch and run tests
|
||||||
|
- Upgrade a cluster from 1.6 to the master branch and run tests
|
||||||
|
- [1.7-master-kubectl-skew](https://k8s-testgrid.appspot.com/master-kubectl-skew)
|
||||||
|
- Run tests skewing the master and kubectl by +1/-1 version
|
||||||
|
|
||||||
|
## Triaging ownership for test failures
|
||||||
|
|
||||||
|
When a test is failing, it must be quickly escalated to the correct owner. Tests that
|
||||||
|
are left to fail for days or weeks become toxic and create noise in the system health
|
||||||
|
metrics.
|
||||||
|
|
||||||
|
The [build cop] is expected to ensure that the release blocking tests remain
|
||||||
|
perpetually healthy by monitoring the test grid and escalating failures.
|
||||||
|
|
||||||
|
On test failures, the build cop will follow the [sig escalation](#sig-test-escalation) path.
|
||||||
|
|
||||||
|
*Tests without a responsive owner should be assigned a new owner or disabled.*
|
||||||
|
|
||||||
|
### test failure
|
||||||
|
|
||||||
|
A test is failing.
|
||||||
|
|
||||||
|
*Symptom*: A row in the test grid is consistently failing across multiple jobs
|
||||||
|
|
||||||
|
*How to check for symptom*: Go to the [triage tool], and
|
||||||
|
search for the failing test by name. Check to see if it is failing across
|
||||||
|
multiple jobs, or just one.
|
||||||
|
|
||||||
|
*Action*: Escalate to the owning SIG present in the test name (e.g. SIG-cli)
|
||||||
|
|
||||||
|
### test job failure
|
||||||
|
|
||||||
|
A test *job* is unhealthy causing multiple unrelated tests to fail.
|
||||||
|
|
||||||
|
*Symptom*: Multiple unrelated rows in the test grid are consistently failing in a single job,
|
||||||
|
but passing in others jobs.
|
||||||
|
|
||||||
|
*How to check for symptom*: Go to the [test grid]. Are a bunch of tests failing or just a couple? Are
|
||||||
|
those tests passing on other jobs?
|
||||||
|
|
||||||
|
*Action*: Escalate to the owning SIG for the test job.
|
||||||
|
|
||||||
|
### test failure (only on specifics job)
|
||||||
|
|
||||||
|
A test is failing, but only on specific jobs.
|
||||||
|
|
||||||
|
*Symptom*: A row in the test grid is consistently failing on a single job, but passing on other jobs.
|
||||||
|
|
||||||
|
*How to check for symptom*: Go to the [triage tool], and
|
||||||
|
search for the failing test by name. Check to see if it is failing across
|
||||||
|
multiple jobs, or just one.
|
||||||
|
|
||||||
|
*Action*: Escalate to the owning SIG present in the test name (e.g. SIG-cli). They
|
||||||
|
will coordinate a fix with the test job owner.
|
||||||
|
|
||||||
|
## Triaging ownership for test flakes
|
||||||
|
|
||||||
|
To triage ownership flakes, follow the same escalation process for failures. Flakes are considered less
|
||||||
|
urgent than persistent failures, but still expected to have a root cause investigation within 1 week.
|
||||||
|
|
||||||
|
## Broken test workflow
|
||||||
|
|
||||||
|
SIGs are expected to proactively monitor and maintain their tests. The build cop will also
|
||||||
|
monitor the health of the entire project, but is intended as backup who will escalate
|
||||||
|
failures to the owning SIGs.
|
||||||
|
|
||||||
|
- File an issue for the broken test so it can be referenced and discovered
|
||||||
|
- Set the following labels: `priority/failing-test`, `sig/*`
|
||||||
|
- Assign the issue to whoever is working on it
|
||||||
|
- Mention the current build cop (TODO: publish this somewhere)
|
||||||
|
- Root cause analysis of the test failure is performed by the owner
|
||||||
|
- **Note**: The owning SIG for a test can reassign ownership of a resolution to another SIG only after getting
|
||||||
|
approval from that SIG
|
||||||
|
- This is done by the target SIG reassigning to themselves, not the test owning SIG assigning to someone else.
|
||||||
|
- Tests failure is resolved either by fixing the underlying issue or disabling the test
|
||||||
|
- Disabling a test maybe the correct thing to do in some cases - such as upgrade tests running e2e tests for alpha
|
||||||
|
features disable in newer releases.
|
||||||
|
- SIG owner monitors the test grid to make sure the tests begin to pass
|
||||||
|
- SIG owner closes the issue
|
||||||
|
|
||||||
|
## SIG test escalation
|
||||||
|
|
||||||
|
The build cop should monitor the overall test health of the project, and ensure ownership for any given
|
||||||
|
test does not fall through the cracks. When the build cop observer a test failure, they should first
|
||||||
|
search to see if an issue has been filed already, and if not (optionally file an issue and) escalate to the SIG
|
||||||
|
escalation point. If the escalation point is unresponsive within a day, the build cop should escalate to the SIG
|
||||||
|
googlegroup and/or slack channel, mentioning the SIG leads. If escalation through the SIG googlegroup,
|
||||||
|
slack channel and SIG leads is unsuccessful, the build cop should escalate to SIG release through the
|
||||||
|
googlegroup and slack - mentioning the SIG leads.
|
||||||
|
|
||||||
|
The SIG escalation points should be bootstrapped from the [community sig list].
|
||||||
|
|
||||||
|
## SIG Recommendations
|
||||||
|
|
||||||
|
- Figure out which e2e test jobs are release blocking for your SIG.
|
||||||
|
- Develop a process for making sure the SIGs test grid remains healthy and resolving test failures.
|
||||||
|
- Consider moving the e2e tests for the SIG into their own test jobs if this would make maintaining them easier.
|
||||||
|
- Consider developing a playbook for how to resolve test failures and how do identify whether or not another SIG owns the resolution of the issue.
|
||||||
|
|
||||||
|
[community sig list]: (https://github.com/kubernetes/community/blob/master/sig-list.md)
|
||||||
|
[triage tool]: (https://storage.googleapis.com/k8s-gubernator/triage/index.html)
|
||||||
|
[test grid]: (https://k8s-testgrid.appspot.com/)
|
||||||
|
[build cop]: (https://github.com/kubernetes/community/blob/master/contributors/devel/on-call-build-cop.md)
|
||||||
|
[release-master-blocking]: (https://k8s-testgrid.appspot.com/release-master-blocking#Summary)
|
||||||
|
[1.7-master-upgrade]: (https://k8s-testgrid.appspot.com/1.7-master-upgrade#Summary)
|
||||||
|
[1.6-master-upgrade]: (https://k8s-testgrid.appspot.com/1.6-master-upgrade#Summary)
|
||||||
|
[1.7-master-kubectl-skew]: (https://k8s-testgrid.appspot.com/1.6-1.7-kubectl-skew)
|
||||||
|
[job/config]: (https://github.com/kubernetes/test-infra/blob/master/jobs/config.json)
|
||||||
Loading…
Reference in New Issue