Deprecate Kubernetes on-call rotations

This commit is contained in:
saadali 2017-10-19 14:19:01 -07:00
parent 55a3365757
commit fc975ed591
5 changed files with 60 additions and 195 deletions

View File

@ -40,11 +40,57 @@ and this document will cover the basic ones.
Sometimes users ask for support requests in issues; these are usually requests
from people who need help configuring some aspect of Kubernetes. These should be
directed to our [support structures](https://github.com/kubernetes/community/blob/master/contributors/devel/on-call-user-support.md) and then closed. Also, if the issue is clearly abandoned or in
the wrong place, it should be closed. Keep in mind that only issue reporter,
assignees and component organization members can close issue. If you do not
have such privilege, just comment your findings. Otherwise, first `/assign`
issue to yourself and then `/close`.
directed to our support structures (see below) and then closed. Also, if the issue
is clearly abandoned or in the wrong place, it should be closed. Keep in mind that
only issue reporter, assignees and component organization members can close issue.
If you do not have such privilege, just comment your findings. Otherwise, first
`/assign` issue to yourself and then `/close`.
### Support Structures
Support requests should be directed to the following:
* [User documentation](https://kubernetes.io/docs/) and
[troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
* [Stack Overflow](http://stackoverflow.com/questions/tagged/kubernetes) and
[ServerFault](http://serverfault.com/questions/tagged/google-kubernetes)
* [Slack](https://kubernetes.slack.com) ([registration](http://slack.k8s.io))
* Check out the [Slack Archive](http://kubernetes.slackarchive.io/) first.
* [Email/Groups](https://groups.google.com/forum/#!forum/kubernetes-users)
### User support response example
If you see support questions on kubernetes-dev@googlegroups.com or issues asking for
support try to redirect them to Stack Overflow. Example response:
```code
Please re-post your question to [Stack Overflow]
(http://stackoverflow.com/questions/tagged/kubernetes).
We are trying to consolidate the channels to which questions for help/support
are posted so that we can improve our efficiency in responding to your requests,
and to make it easier for you to find answers to frequently asked questions and
how to address common use cases.
We regularly see messages posted in multiple forums, with the full response
thread only in one place or, worse, spread across multiple forums. Also, the
large volume of support issues on github is making it difficult for us to use
issues to identify real bugs.
Members of the Kubernetes community use Stack Overflow to field support
requests. Before posting a new question, please search Stack Overflow for answers
to similar questions, and also familiarize yourself with:
* [user documentation](http://kubernetes.io/docs/)
* [troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
Again, thanks for using Kubernetes.
The Kubernetes Team
```
## Find the right SIG(s)
Components are divided among [Special Interest Groups (SIGs)](https://github.com/kubernetes/community/blob/master/sig-list.md). Find a proper SIG for the ownership of the issue using the bot:

View File

@ -1,50 +0,0 @@
# Kubernetes BuildCop Workflow
June 2017
## Objective
This document describes the responsibilities and the workflow of a person assuming the buildcop role.
The current buildcop can be found [here](https://storage.googleapis.com/kubernetes-jenkins/oncall.html).
## Prerequisites for build-copping
- Ensure you have admin access to [http://github.com/kubernetes/kubernetes](http://github.com/kubernetes/kubernetes)
- Check your membership in the GitHub team: [kubernetes-build-cops](https://github.com/orgs/kubernetes/teams/kubernetes-build-cops/members).
If you are not a member contact one of the team maintainers to get yourself added to it.
- Test your admin access by e.g. adding a label to an issue.
- You must communicate any concerns/actions via the **#sig-release** slack channel to ensure that
the release team has context on the current state of the submit queue.
- You must attend the release burndown meeting to provide an update on the current state of the submit-queue
## Responsibilities
The build-cop's primary responsibility is to ensure that automatic merges are happening at a
**reasonable** rate. This may include performing merging of test flake PRs when the pre-submits
are failing repeatedly. The buildcop must be familiar with the
[queue labels](https://submit-queue.k8s.io/#/info) and apply them as necessary to critical fixes.
The priority labels are defunct and no longer respected by the submit-queue. As of June 2017,
the merge rate is ~30 PRs per day if there are that many PRs in the queue. The previous
responsibilities of this role included classification of incoming issues, but that is no
longer a part of the mandate.
## Workflow
1. Check the Prow batch dashboard: [https://prow.k8s.io/?type=batch](https://prow.k8s.io/?type=batch)
to ensure that batch jobs are running regularly. It's okay to see occasional flakes. Do not worry
about manually re-running individual tests, since Prow will rerun them.
2. If there are post-submit blocking jobs (see [link](https://submit-queue.k8s.io/#/e2e)), ensure
that those builds are green and allowing merges to occur.
3. If several batch merges are failing, file an issue for that job and describe the possible
causes for the failure. Debug if possible, else triage and assign to a particular SIG, and
@-mention the maintainers. For example, see:
[#47135](https://github.com/kubernetes/kubernetes/issues/47135)
4. Communicate the actions to **#sig-release** via slack and ensure that the issue is being worked on.
5. If the issue is not worked on for several hours, please escalate to the release team.
The release team members can be found via the [features](https://github.com/kubernetes/features) repo.
For example, the Kubernetes 1.7 release team members are listed [here](https://github.com/kubernetes/features/blob/master/release-1.7/release_team.md).
Notify the release manager/release team members via GitHub mentions and slack.
6. When the SIG member sends a fix, manually merge if necessary, after verifying that pre-submits pass,
or use the 'retest-not-required' label with the appropriate 'queue/*' label to ensure merge of the
flake fix.
7. Issue an update to the **#sig-release** channel on the merge rate and the PR that was used to fix the queue.

View File

@ -1,43 +0,0 @@
## Kubernetes On-Call Rotations
### Kubernetes "first responder" rotations
Kubernetes has generated a lot of public traffic: email, pull-requests, bugs,
etc. So much traffic that it's becoming impossible to keep up with it all! This
is a fantastic problem to have. In order to be sure that SOMEONE, but not
EVERYONE on the team is paying attention to public traffic, we have instituted
two "first responder" rotations, listed below. Please read this page before
proceeding to the pages linked below, which are specific to each rotation.
Please also read our [notes on OSS collaboration](collab.md), particularly the
bits about hours. Specifically, each rotation is expected to be active primarily
during work hours, less so off hours.
During regular workday work hours of your shift, your primary responsibility is
to monitor the traffic sources specific to your rotation. You can check traffic
in the evenings if you feel so inclined, but it is not expected to be as highly
focused as work hours. For weekends, you should check traffic very occasionally
(e.g. once or twice a day). Again, it is not expected to be as highly focused as
workdays. It is assumed that over time, everyone will get weekday and weekend
shifts, so the workload will balance out.
If you can not serve your shift, and you know this ahead of time, it is your
responsibility to find someone to cover and to change the rotation. If you have
an emergency, your responsibilities fall on the primary of the other rotation,
who acts as your secondary. If you need help to cover all of the tasks, partners
with oncall rotations (e.g.,
[Redhat](https://github.com/orgs/kubernetes/teams/rh-oncall)).
If you are not on duty you DO NOT need to do these things. You are free to focus
on "real work".
Note that Kubernetes will occasionally enter code slush/freeze, prior to
milestones. When it does, there might be changes in the instructions (assigning
milestones, for instance).
* [Github and Build Cop Rotation](on-call-build-cop.md)
* [User Support Rotation](on-call-user-support.md)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/on-call-rotations.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,83 +0,0 @@
## Kubernetes "User Support" Rotation
### Traffic sources and responsibilities
* [Stack Overflow](http://stackoverflow.com/questions/tagged/kubernetes) and
[ServerFault](http://serverfault.com/questions/tagged/google-kubernetes):
Respond to any thread that has no responses and is more than 6 hours old (over
time we will lengthen this timeout to allow community responses). If you are not
equipped to respond, it is your job to redirect to someone who can.
* [Query for unanswered Kubernetes Stack Overflow questions](http://stackoverflow.com/search?tab=newest&q=%5bkubernetes%5d%20answers%3a0)
* [Query for unanswered Kubernetes ServerFault questions](https://serverfault.com/search?tab=newest&q=%5bgoogle-kubernetes%5d%20answers%3a0)
* Direct poorly formulated questions to [Stack Overflow's tips about how to ask](http://stackoverflow.com/help/how-to-ask)
* Direct off-topic questions to [Stack Overflow's policy](http://stackoverflow.com/help/on-topic)
* [Slack](https://kubernetes.slack.com) ([registration](http://slack.k8s.io)):
Your job is to be on Slack, watching for questions and answering or redirecting
as needed, such as to a SIG-specific channel. Please especially watch
`#kubernetes-users` and `#kubernetes-novice`. Also check out the
[Slack Archive](http://kubernetes.slackarchive.io/).
* [Email/Groups](https://groups.google.com/forum/#!forum/kubernetes-users):
Respond to any thread that has no responses and is more than 6 hours old (over
time we will lengthen this timeout to allow community responses). If you are not
equipped to respond, it is your job to redirect to someone who can.
* on slack: Respond to questions that
don't get answers.
In general, try to direct support questions to:
1. Documentation, such as the [user documentation](https://kubernetes.io/docs/) and
[troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
2. Stack Overflow
#### User support response example
If you see questions on kubernetes-dev@googlegroups.com, try to redirect them
to Stack Overflow. Example response:
```code
Please re-post your question to [Stack Overflow]
(http://stackoverflow.com/questions/tagged/kubernetes).
We are trying to consolidate the channels to which questions for help/support
are posted so that we can improve our efficiency in responding to your requests,
and to make it easier for you to find answers to frequently asked questions and
how to address common use cases.
We regularly see messages posted in multiple forums, with the full response
thread only in one place or, worse, spread across multiple forums. Also, the
large volume of support issues on github is making it difficult for us to use
issues to identify real bugs.
The Kubernetes team scans Stack Overflow on a regular basis, and will try to
ensure your questions don't go unanswered.
Before posting a new question, please search Stack Overflow for answers to
similar questions, and also familiarize yourself with:
* [user documentation](http://kubernetes.io/docs/)
* [troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
Again, thanks for using Kubernetes.
The Kubernetes Team
```
If you answer a question (in any of the above forums) that you think might be
useful for someone else in the future, please send a PR or file an issue in
[kubernetes.github.io](https://github.com/kubernetes/kubernetes.github.io).
### Contact information
[@k8s-oncall](https://github.com/k8s-oncall) will reach the
current person on call.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/on-call-user-support.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -81,10 +81,11 @@ When a test is failing, it must be quickly escalated to the correct owner. Test
are left to fail for days or weeks become toxic and create noise in the system health
metrics.
The [build cop] is expected to ensure that the release blocking tests remain
Each SIG is expected to ensure that the release blocking tests that belong to the SIG remain
perpetually healthy by monitoring the test grid and escalating failures.
On test failures, the build cop will follow the [sig escalation](#sig-test-escalation) path.
Failing tests that are not being addressed, can be escalated by following the
[sig escalation](#sig-test-escalation) path.
*Tests without a responsive owner should be assigned a new owner or disabled.*
@ -132,14 +133,11 @@ urgent than persistent failures, but still expected to have a root cause investi
## Broken test workflow
SIGs are expected to proactively monitor and maintain their tests. The build cop will also
monitor the health of the entire project, but is intended as backup who will escalate
failures to the owning SIGs.
SIGs are expected to proactively monitor and maintain their tests.
- File an issue for the broken test so it can be referenced and discovered
- Set the following labels: `priority/failing-test`, `sig/*`
- Assign the issue to whoever is working on it
- Mention the current build cop (TODO: publish this somewhere)
- Root cause analysis of the test failure is performed by the owner
- **Note**: The owning SIG for a test can reassign ownership of a resolution to another SIG only after getting
approval from that SIG
@ -152,13 +150,11 @@ failures to the owning SIGs.
## SIG test escalation
The build cop should monitor the overall test health of the project, and ensure ownership for any given
test does not fall through the cracks. When the build cop observer a test failure, they should first
search to see if an issue has been filed already, and if not (optionally file an issue and) escalate to the SIG
escalation point. If the escalation point is unresponsive within a day, the build cop should escalate to the SIG
googlegroup and/or slack channel, mentioning the SIG leads. If escalation through the SIG googlegroup,
slack channel and SIG leads is unsuccessful, the build cop should escalate to SIG release through the
googlegroup and slack - mentioning the SIG leads.
As a Kubernetes developers if you observe a test failure, first search to see if an issue has been filed already,
and if not (optionally file an issue and) escalate to the SIG escalation point.
If the escalation point is unresponsive within a day, escalate to the SIG googlegroup and/or slack channel,
mentioning the SIG leads. If escalation through the SIG googlegroup, slack channel and SIG leads is unsuccessful,
escalate to SIG release through the googlegroup and slack - mentioning the SIG leads.
The SIG escalation points should be bootstrapped from the [community sig list].
@ -172,7 +168,6 @@ The SIG escalation points should be bootstrapped from the [community sig list].
[community sig list]: https://github.com/kubernetes/community/blob/master/sig-list.md
[triage tool]: https://storage.googleapis.com/k8s-gubernator/triage/index.html
[test grid]: https://k8s-testgrid.appspot.com/
[build cop]: https://github.com/kubernetes/community/blob/master/contributors/devel/on-call-build-cop.md
[release-master-blocking]: https://k8s-testgrid.appspot.com/release-master-blocking#Summary
[1.7-master-upgrade]: https://k8s-testgrid.appspot.com/1.7-master-upgrade#Summary
[1.6-master-upgrade]: https://k8s-testgrid.appspot.com/1.6-master-upgrade#Summary