Deprecate Kubernetes on-call rotations

2017-10-19 14:19:01 -07:00 · 2017-10-19 14:19:01 -07:00 · fc975ed591
parent 55a3365757
commit fc975ed591
5 changed files with 60 additions and 195 deletions
--- a/contributors/devel/issues.md
+++ b/contributors/devel/issues.md
@ -40,11 +40,57 @@ and this document will cover the basic ones.

 Sometimes users ask for support requests in issues; these are usually requests
 from people who need help configuring some aspect of Kubernetes. These should be
-directed to our [support structures](https://github.com/kubernetes/community/blob/master/contributors/devel/on-call-user-support.md) and then closed. Also, if the issue is clearly abandoned or in
-the wrong place, it should be closed. Keep in mind that only issue reporter,
-assignees and component organization members can close issue. If you do not
-have such privilege, just comment your findings. Otherwise, first `/assign`
-issue to yourself and then `/close`.
+directed to our support structures (see below) and then closed. Also, if the issue 
+is clearly abandoned or in the wrong place, it should be closed. Keep in mind that 
+only issue reporter, assignees and component organization members can close issue. 
+If you do not have such privilege, just comment your findings. Otherwise, first
+`/assign` issue to yourself and then `/close`.
+
+### Support Structures
+
+Support requests should be directed to the following:
+
+* [User documentation](https://kubernetes.io/docs/) and
+[troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
+
+* [Stack Overflow](http://stackoverflow.com/questions/tagged/kubernetes) and
+[ServerFault](http://serverfault.com/questions/tagged/google-kubernetes)
+
+* [Slack](https://kubernetes.slack.com) ([registration](http://slack.k8s.io))
+  * Check out the [Slack Archive](http://kubernetes.slackarchive.io/) first.
+
+* [Email/Groups](https://groups.google.com/forum/#!forum/kubernetes-users)
+
+### User support response example
+
+If you see support questions on kubernetes-dev@googlegroups.com or issues asking for 
+support try to redirect them to Stack Overflow. Example response:
+
+```code
+Please re-post your question to [Stack Overflow]
+(http://stackoverflow.com/questions/tagged/kubernetes).
+
+We are trying to consolidate the channels to which questions for help/support
+are posted so that we can improve our efficiency in responding to your requests,
+and to make it easier for you to find answers to frequently asked questions and
+how to address common use cases.
+
+We regularly see messages posted in multiple forums, with the full response
+thread only in one place or, worse, spread across multiple forums. Also, the
+large volume of support issues on github is making it difficult for us to use
+issues to identify real bugs.
+
+Members of the Kubernetes community use Stack Overflow to field support
+requests. Before posting a new question, please search Stack Overflow for answers 
+to similar questions, and also familiarize yourself with:
+
+  * [user documentation](http://kubernetes.io/docs/)
+  * [troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
+
+Again, thanks for using Kubernetes.
+
+The Kubernetes Team
+```

 ## Find the right SIG(s)
 Components are divided among [Special Interest Groups (SIGs)](https://github.com/kubernetes/community/blob/master/sig-list.md). Find a proper SIG for the ownership of the issue using the bot:
--- a/contributors/devel/on-call-build-cop.md
+++ b/contributors/devel/on-call-build-cop.md
@ -1,50 +0,0 @@
-# Kubernetes BuildCop Workflow
-
-June 2017
-
-## Objective
-
-This document describes the responsibilities and the workflow of a person assuming the buildcop role. 
-The current buildcop can be found [here](https://storage.googleapis.com/kubernetes-jenkins/oncall.html).
-
-## Prerequisites for build-copping
-
- Ensure you have admin access to [http://github.com/kubernetes/kubernetes](http://github.com/kubernetes/kubernetes)
-  - Check your membership in the GitHub team: [kubernetes-build-cops](https://github.com/orgs/kubernetes/teams/kubernetes-build-cops/members). 
-  If you are not a member contact one of the team maintainers to get yourself added to it.
-  - Test your admin access by e.g. adding a label to an issue.
- You must communicate any concerns/actions via the **#sig-release** slack channel to ensure that 
-the release team has context on the current state of the submit queue.
- You must attend the release burndown meeting to provide an update on the current state of the submit-queue
-
-## Responsibilities
-
-The build-cop's primary responsibility is to ensure that automatic merges are happening at a 
-**reasonable** rate. This may include performing merging of test flake PRs when the pre-submits 
-are failing repeatedly. The buildcop must be familiar with the 
-[queue labels](https://submit-queue.k8s.io/#/info) and apply them as necessary to critical fixes. 
-The priority labels are defunct and no longer respected by the submit-queue. As of June 2017, 
-the merge rate is ~30 PRs per day if there are that many PRs in the queue. The previous 
-responsibilities of this role included classification of incoming issues, but that is no 
-longer a part of the mandate.
-
-## Workflow
-
-1. Check the Prow batch dashboard: [https://prow.k8s.io/?type=batch](https://prow.k8s.io/?type=batch) 
-to ensure that batch jobs are running regularly. It's okay to see occasional flakes. Do not worry
-about manually re-running individual tests, since Prow will rerun them.
-2. If there are post-submit blocking jobs (see [link](https://submit-queue.k8s.io/#/e2e)), ensure 
-that those builds are green and allowing merges to occur.
-3. If several batch merges are failing, file an issue for that job and describe the possible 
-causes for the failure. Debug if possible, else triage and assign to a particular SIG, and 
-@-mention the maintainers. For example, see: 
-[#47135](https://github.com/kubernetes/kubernetes/issues/47135)
-4. Communicate the actions to **#sig-release** via slack and ensure that the issue is being worked on.
-5. If the issue is not worked on for several hours, please escalate to the release team. 
-  The release team members can be found via the [features](https://github.com/kubernetes/features) repo.
-  For example, the Kubernetes 1.7 release team members are listed [here](https://github.com/kubernetes/features/blob/master/release-1.7/release_team.md).
-  Notify the release manager/release team members via GitHub mentions and slack. 
-6. When the SIG member sends a fix, manually merge if necessary, after verifying that pre-submits pass, 
-or use the 'retest-not-required' label with the appropriate 'queue/*' label to ensure merge of the 
-flake fix.
-7. Issue an update to the **#sig-release** channel on the merge rate and the PR that was used to fix the queue.
--- a/contributors/devel/on-call-rotations.md
+++ b/contributors/devel/on-call-rotations.md
@ -1,43 +0,0 @@
-## Kubernetes On-Call Rotations
-
-### Kubernetes "first responder" rotations
-
-Kubernetes has generated a lot of public traffic: email, pull-requests, bugs,
-etc. So much traffic that it's becoming impossible to keep up with it all! This
-is a fantastic problem to have. In order to be sure that SOMEONE, but not
-EVERYONE on the team is paying attention to public traffic, we have instituted
-two "first responder" rotations, listed below. Please read this page before
-proceeding to the pages linked below, which are specific to each rotation.
-
-Please also read our [notes on OSS collaboration](collab.md), particularly the
-bits about hours. Specifically, each rotation is expected to be active primarily
-during work hours, less so off hours.
-
-During regular workday work hours of your shift, your primary responsibility is
-to monitor the traffic sources specific to your rotation. You can check traffic
-in the evenings if you feel so inclined, but it is not expected to be as highly
-focused as work hours. For weekends, you should check traffic very occasionally
-(e.g. once or twice a day). Again, it is not expected to be as highly focused as
-workdays. It is assumed that over time, everyone will get weekday and weekend
-shifts, so the workload will balance out.
-
-If you can not serve your shift, and you know this ahead of time, it is your
-responsibility to find someone to cover and to change the rotation. If you have
-an emergency, your responsibilities fall on the primary of the other rotation,
-who acts as your secondary. If you need help to cover all of the tasks, partners
-with oncall rotations (e.g.,
-[Redhat](https://github.com/orgs/kubernetes/teams/rh-oncall)).
-
-If you are not on duty you DO NOT need to do these things. You are free to focus
-on "real work".
-
-Note that Kubernetes will occasionally enter code slush/freeze, prior to
-milestones. When it does, there might be changes in the instructions (assigning
-milestones, for instance).
-
-* [Github and Build Cop Rotation](on-call-build-cop.md)
-* [User Support Rotation](on-call-user-support.md)
-
-<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
-[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/on-call-rotations.md?pixel)]()
-<!-- END MUNGE: GENERATED_ANALYTICS -->
--- a/contributors/devel/on-call-user-support.md
+++ b/contributors/devel/on-call-user-support.md
@ -1,83 +0,0 @@
-## Kubernetes "User Support" Rotation
-
-### Traffic sources and responsibilities
-
-* [Stack Overflow](http://stackoverflow.com/questions/tagged/kubernetes) and
-[ServerFault](http://serverfault.com/questions/tagged/google-kubernetes):
-Respond to any thread that has no responses and is more than 6 hours old (over
-time we will lengthen this timeout to allow community responses). If you are not
-equipped to respond, it is your job to redirect to someone who can.
-
-  * [Query for unanswered Kubernetes Stack Overflow questions](http://stackoverflow.com/search?tab=newest&q=%5bkubernetes%5d%20answers%3a0)
-  * [Query for unanswered Kubernetes ServerFault questions](https://serverfault.com/search?tab=newest&q=%5bgoogle-kubernetes%5d%20answers%3a0)
-  * Direct poorly formulated questions to [Stack Overflow's tips about how to ask](http://stackoverflow.com/help/how-to-ask)
-  * Direct off-topic questions to [Stack Overflow's policy](http://stackoverflow.com/help/on-topic)
-
-* [Slack](https://kubernetes.slack.com) ([registration](http://slack.k8s.io)):
-Your job is to be on Slack, watching for questions and answering or redirecting
-as needed, such as to a SIG-specific channel. Please especially watch
-`#kubernetes-users` and `#kubernetes-novice`. Also check out the
-[Slack Archive](http://kubernetes.slackarchive.io/).
-
-* [Email/Groups](https://groups.google.com/forum/#!forum/kubernetes-users):
-Respond to any thread that has no responses and is more than 6 hours old (over
-time we will lengthen this timeout to allow community responses). If you are not
-equipped to respond, it is your job to redirect to someone who can.
-
-*  on slack: Respond to questions that
-don't get answers.
-
-In general, try to direct support questions to:
-
-1. Documentation, such as the [user documentation](https://kubernetes.io/docs/) and
-[troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
-
-2. Stack Overflow
-
-#### User support response example
-
-If you see questions on kubernetes-dev@googlegroups.com, try to redirect them
-to Stack Overflow. Example response:
-
-```code
-Please re-post your question to [Stack Overflow]
-(http://stackoverflow.com/questions/tagged/kubernetes).
-
-We are trying to consolidate the channels to which questions for help/support
-are posted so that we can improve our efficiency in responding to your requests,
-and to make it easier for you to find answers to frequently asked questions and
-how to address common use cases.
-
-We regularly see messages posted in multiple forums, with the full response
-thread only in one place or, worse, spread across multiple forums. Also, the
-large volume of support issues on github is making it difficult for us to use
-issues to identify real bugs.
-
-The Kubernetes team scans Stack Overflow on a regular basis, and will try to
-ensure your questions don't go unanswered.
-
-Before posting a new question, please search Stack Overflow for answers to
-similar questions, and also familiarize yourself with:
-
-  * [user documentation](http://kubernetes.io/docs/)
-  * [troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
-
-Again, thanks for using Kubernetes.
-
-The Kubernetes Team
-```
-
-If you answer a question (in any of the above forums) that you think might be
-useful for someone else in the future, please send a PR or file an issue in
-[kubernetes.github.io](https://github.com/kubernetes/kubernetes.github.io).
-
-### Contact information
-
-[@k8s-oncall](https://github.com/k8s-oncall) will reach the
-current person on call.
-
-
-
-<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
-[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/on-call-user-support.md?pixel)]()
-<!-- END MUNGE: GENERATED_ANALYTICS -->
--- a/contributors/devel/release/testing.md
+++ b/contributors/devel/release/testing.md
@ -81,10 +81,11 @@ When a test is failing, it must be quickly escalated to the correct owner.  Test
 are left to fail for days or weeks become toxic and create noise in the system health
 metrics.

-The [build cop] is expected to ensure that the release blocking tests remain
+Each SIG is expected to ensure that the release blocking tests that belong to the SIG remain
 perpetually healthy by monitoring the test grid and escalating failures.

-On test failures, the build cop will follow the [sig escalation](#sig-test-escalation) path.
+Failing tests that are not being addressed, can be escalated by following the 
+[sig escalation](#sig-test-escalation) path.

 *Tests without a responsive owner should be assigned a new owner or disabled.*

@ -132,14 +133,11 @@ urgent than persistent failures, but still expected to have a root cause investi

 ## Broken test workflow

-SIGs are expected to proactively monitor and maintain their tests.  The build cop will also
-monitor the health of the entire project, but is intended as backup who will escalate
-failures to the owning SIGs.
+SIGs are expected to proactively monitor and maintain their tests.

 - File an issue for the broken test so it can be referenced and discovered
  - Set the following labels: `priority/failing-test`, `sig/*`
  - Assign the issue to whoever is working on it
-  - Mention the current build cop (TODO: publish this somewhere)
 - Root cause analysis of the test failure is performed by the owner
 - **Note**: The owning SIG for a test can reassign ownership of a resolution to another SIG only after getting
  approval from that SIG
@ -152,13 +150,11 @@ failures to the owning SIGs.

 ## SIG test escalation

-The build cop should monitor the overall test health of the project, and ensure ownership for any given
-test does not fall through the cracks.  When the build cop observer a test failure, they should first
-search to see if an issue has been filed already, and if not (optionally file an issue and) escalate to the SIG
-escalation point.  If the escalation point is unresponsive within a day, the build cop should escalate to the SIG
-googlegroup and/or slack channel, mentioning the SIG leads.  If escalation through the SIG googlegroup,
-slack channel and SIG leads is unsuccessful, the build cop should escalate to SIG release through the
-googlegroup and slack - mentioning the SIG leads.
+As a Kubernetes developers if you observe a test failure, first search to see if an issue has been filed already, 
+and if not (optionally file an issue and) escalate to the SIG escalation point.
+If the escalation point is unresponsive within a day, escalate to the SIG googlegroup and/or slack channel, 
+mentioning the SIG leads.  If escalation through the SIG googlegroup, slack channel and SIG leads is unsuccessful, 
+escalate to SIG release through the googlegroup and slack - mentioning the SIG leads.

 The SIG escalation points should be bootstrapped from the [community sig list].

@ -172,7 +168,6 @@ The SIG escalation points should be bootstrapped from the [community sig list].
 [community sig list]: https://github.com/kubernetes/community/blob/master/sig-list.md
 [triage tool]: https://storage.googleapis.com/k8s-gubernator/triage/index.html
 [test grid]: https://k8s-testgrid.appspot.com/
-[build cop]: https://github.com/kubernetes/community/blob/master/contributors/devel/on-call-build-cop.md
 [release-master-blocking]: https://k8s-testgrid.appspot.com/release-master-blocking#Summary
 [1.7-master-upgrade]: https://k8s-testgrid.appspot.com/1.7-master-upgrade#Summary
 [1.6-master-upgrade]: https://k8s-testgrid.appspot.com/1.6-master-upgrade#Summary