From 10ff89c5fad124a7f31edf9abb12e6bf59c82762 Mon Sep 17 00:00:00 2001 From: Anirudh Ramanathan Date: Mon, 19 Jun 2017 11:01:33 -0700 Subject: [PATCH] Update buildcop instructions Based on https://docs.google.com/document/d/11nf_tg3_0OTHfKpJdaaPBJrCMIHHWyr5cBZ1EXDnI0o/edit?ts=59477035# --- contributors/devel/on-call-build-cop.md | 143 ++++++------------------ 1 file changed, 35 insertions(+), 108 deletions(-) diff --git a/contributors/devel/on-call-build-cop.md b/contributors/devel/on-call-build-cop.md index ab80faea7..af6803743 100644 --- a/contributors/devel/on-call-build-cop.md +++ b/contributors/devel/on-call-build-cop.md @@ -1,117 +1,44 @@ -## Kubernetes "Github and Build-cop" Rotation +# Kubernetes BuildCop Workflow -### Preqrequisites +June 2017 -* Ensure you have [write access to http://github.com/kubernetes/kubernetes](https://github.com/orgs/kubernetes/teams/kubernetes-maintainers) - * Test your admin access by e.g. adding a label to an issue. +## Objective -### Traffic sources and responsibilities +This document describes the responsibilities and the workflow of a person assuming the buildcop role. +The current buildcop can be found [here](https://storage.googleapis.com/kubernetes-jenkins/oncall.html). -* GitHub Kubernetes [issues](https://github.com/kubernetes/kubernetes/issues): -Your job is to be -the first responder to all new issues. If you are not equipped to do -this (which is fine!), it is your job to seek guidance! +## Prerequisites for build-copping - * Support issues should be closed and redirected to Stack Overflow (see example -response [here](on-call-user-support.md#user-support-response-example)). +- Ensure you have write access to [http://github.com/kubernetes/kubernetes](http://github.com/kubernetes/kubernetes) + - Test your admin access by e.g. adding a label to an issue. +- You must communicate any concerns/actions via the **#sig-release** slack channel to ensure that +the release team has context on the current state of the submit queue. +- You must attend the release burndown meeting to provide an update on the current state of the submit-queue - * All incoming issues should be tagged with a team label -(team/{api,ux,control-plane,node,cluster,csi,redhat,mesosphere,gke,release-infra,test-infra,none}); -for issues that overlap teams, you can use multiple team labels +## Responsibilities - * There is a related concept of "Github teams" which allow you to @ mention -a set of people; feel free to @ mention a Github team if you wish, but this is -not a substitute for adding a team/* label, which is required +The build-cop's primary responsibility is to ensure that automatic merges are happening at a +**reasonable** rate. This may include performing merging of test flake PRs when the pre-submits +are failing repeatedly. The buildcop must be familiar with the +[queue labels](https://submit-queue.k8s.io/#/info) and apply them as necessary to critical fixes. +The priority labels are defunct and no longer respected by the submit-queue. As of June 2017, +the merge rate is ~30 PRs per day if there are that many PRs in the queue. The previous +responsibilities of this role included classification of incoming issues, but that is no +longer a part of the mandate. - * [Google teams](https://github.com/orgs/kubernetes/teams?utf8=%E2%9C%93&query=goog-) - * [Redhat teams](https://github.com/orgs/kubernetes/teams?utf8=%E2%9C%93&query=rh-) - * [SIGs](https://github.com/orgs/kubernetes/teams?utf8=%E2%9C%93&query=sig-) +## Workflow - * If the issue is reporting broken builds, broken e2e tests, or other -obvious P0 issues, label the issue with priority/P0 and assign it to someone. -This is the only situation in which you should add a priority/* label - * non-P0 issues do not need a reviewer assigned initially - - * Assign any issues related to Vagrant to @derekwaynecarr (and @mention him -in the issue) - - * Keep in mind that you can @ mention people in an issue to bring it to -their attention without assigning it to them. You can also @ mention github -teams, such as @kubernetes/goog-ux or @kubernetes/kubectl - - * If you need help triaging an issue, consult with (or assign it to) -@brendandburns, @thockin, @bgrant0607, @davidopp, @dchen1107, -@lavalamp (all U.S. Pacific Time) or @fgrzadkowski (Central European Time). - - * At the beginning of your shift, please add team/* labels to any issues that -have fallen through the cracks and don't have one. Likewise, be fair to the next -person in rotation: try to ensure that every issue that gets filed while you are -on duty is handled. The Github query to find issues with no team/* label is: -[here](https://github.com/kubernetes/kubernetes/issues?utf8=%E2%9C%93&q=is%3Aopen+is%3Aissue+-label%3Ateam%2Fcontrol-plane+-label%3Ateam%2Fmesosphere+-label%3Ateam%2Fredhat+-label%3Ateam%2Frelease-infra+-label%3Ateam%2Fnone+-label%3Ateam%2Fnode+-label%3Ateam%2Fcluster+-label%3Ateam%2Fux+-label%3Ateam%2Fapi+-label%3Ateam%2Ftest-infra+-label%3Ateam%2Fgke+-label%3A"team%2FCSI-API+Machinery+SIG"+-label%3Ateam%2Fhuawei+-label%3Ateam%2Fsig-aws). - -### Build-copping - -* The [merge-bot submit queue](http://submit-queue.k8s.io/) -([source](https://github.com/kubernetes/contrib/tree/master/mungegithub/mungers/submit-queue.go)) -should auto-merge all eligible PRs for you once they've passed all the relevant -checks mentioned below and all [critical e2e tests] -(https://goto.google.com/k8s-test/view/Critical%20Builds/) are passing. If the -merge-bot been disabled for some reason, or tests are failing, you might need to -do some manual merging to get things back on track. - -* Once a day or so, look at the [flaky test builds] -(https://goto.google.com/k8s-test/view/Flaky/); if they are timing out, clusters -are failing to start, or tests are consistently failing (instead of just -flaking), file an issue to get things back on track. - -* Jobs that are not in [critical e2e tests](https://goto.google.com/k8s-test/view/Critical%20Builds/) -or [flaky test builds](https://goto.google.com/k8s-test/view/Flaky/) are not -your responsibility to monitor. The `Test owner:` in the job description will be -automatically emailed if the job is failing. - -* If you are oncall, ensure that PRs confirming to the following -pre-requisites are being merged at a reasonable rate: - - * [Have been LGTMd](https://github.com/kubernetes/kubernetes/labels/lgtm) - * Pass Travis and Jenkins per-PR tests. - * Author has signed CLA if applicable. - - -* Although the shift schedule shows you as being scheduled Monday to Monday, - working on the weekend is neither expected nor encouraged. Enjoy your time - off. - -* When the build is broken, roll back the PRs responsible ASAP - -* If the build job itself fails, Jenkins will not try again automatically and everything will halt. You can trigger one at http://kubekins.mtv.corp.google.com/job/ci-kubernetes-build/#. Click `log in`, then click `Build Now` in the left margin. - -* When E2E tests are unstable, a "merge freeze" may be instituted. During a -merge freeze: - - * Oncall should slowly merge LGTMd changes throughout the day while monitoring -E2E to ensure stability. - - * Ideally the E2E run should be green, but some tests are flaky and can fail -randomly (not as a result of a particular change). - * If a large number of tests fail, or tests that normally pass fail, that -is an indication that one or more of the PR(s) in that build might be -problematic (and should be reverted). - * Use the Test Results Analyzer to see individual test history over time. - - -* Flake mitigation - - * Tests that flake (fail a small percentage of the time) need an issue filed -against them. Please read [this](flaky-tests.md#filing-issues-for-flaky-tests); -the build cop is expected to file issues for any flaky tests they encounter. - - * It's reasonable to manually merge PRs that fix a flake or otherwise mitigate it. - -### Contact information - -[@k8s-oncall](https://github.com/k8s-oncall) will reach the current person on -call. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/on-call-build-cop.md?pixel)]() - +1. Check the Prow batch dashboard: [https://prow.k8s.io/?type=batch](https://prow.k8s.io/?type=batch) +to ensure that merges are occurring regularly. +2. If there are post-submit blocking jobs (see [link](https://submit-queue.k8s.io/#/e2e)), ensure +that those builds are green and allowing merges to occur. +3. If several batch merges are failing, file an issue for that job and describe the possible +causes for the failure. Debug if possible, else triage and assign to a particular SIG, and +@-mention the maintainers. For example, see: +[https://github.com/kubernetes/kubernetes/issues/47135](https://github.com/kubernetes/kubernetes/issues/47135) +4. Communicate the actions to # **sig-release** via slack and ensure that the issue is being worked on. + 1. If the issue is not worked on for several hours, please escalate to the release team. +5. When the SIG member sends a fix, manually merge if necessary, after verifying that pre-submits pass, +or use the 'retest-not-required' label with the appropriate 'queue/*' label to ensure merge of the +flake fix. +6. Issue an update to the # **sig-release** channel on the merge rate and the PR that was used to fix the queue.