Deprecate Kubernetes on-call rotations
This commit is contained in:
parent
55a3365757
commit
fc975ed591
|
@ -40,11 +40,57 @@ and this document will cover the basic ones.
|
|||
|
||||
Sometimes users ask for support requests in issues; these are usually requests
|
||||
from people who need help configuring some aspect of Kubernetes. These should be
|
||||
directed to our [support structures](https://github.com/kubernetes/community/blob/master/contributors/devel/on-call-user-support.md) and then closed. Also, if the issue is clearly abandoned or in
|
||||
the wrong place, it should be closed. Keep in mind that only issue reporter,
|
||||
assignees and component organization members can close issue. If you do not
|
||||
have such privilege, just comment your findings. Otherwise, first `/assign`
|
||||
issue to yourself and then `/close`.
|
||||
directed to our support structures (see below) and then closed. Also, if the issue
|
||||
is clearly abandoned or in the wrong place, it should be closed. Keep in mind that
|
||||
only issue reporter, assignees and component organization members can close issue.
|
||||
If you do not have such privilege, just comment your findings. Otherwise, first
|
||||
`/assign` issue to yourself and then `/close`.
|
||||
|
||||
### Support Structures
|
||||
|
||||
Support requests should be directed to the following:
|
||||
|
||||
* [User documentation](https://kubernetes.io/docs/) and
|
||||
[troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
|
||||
|
||||
* [Stack Overflow](http://stackoverflow.com/questions/tagged/kubernetes) and
|
||||
[ServerFault](http://serverfault.com/questions/tagged/google-kubernetes)
|
||||
|
||||
* [Slack](https://kubernetes.slack.com) ([registration](http://slack.k8s.io))
|
||||
* Check out the [Slack Archive](http://kubernetes.slackarchive.io/) first.
|
||||
|
||||
* [Email/Groups](https://groups.google.com/forum/#!forum/kubernetes-users)
|
||||
|
||||
### User support response example
|
||||
|
||||
If you see support questions on kubernetes-dev@googlegroups.com or issues asking for
|
||||
support try to redirect them to Stack Overflow. Example response:
|
||||
|
||||
```code
|
||||
Please re-post your question to [Stack Overflow]
|
||||
(http://stackoverflow.com/questions/tagged/kubernetes).
|
||||
|
||||
We are trying to consolidate the channels to which questions for help/support
|
||||
are posted so that we can improve our efficiency in responding to your requests,
|
||||
and to make it easier for you to find answers to frequently asked questions and
|
||||
how to address common use cases.
|
||||
|
||||
We regularly see messages posted in multiple forums, with the full response
|
||||
thread only in one place or, worse, spread across multiple forums. Also, the
|
||||
large volume of support issues on github is making it difficult for us to use
|
||||
issues to identify real bugs.
|
||||
|
||||
Members of the Kubernetes community use Stack Overflow to field support
|
||||
requests. Before posting a new question, please search Stack Overflow for answers
|
||||
to similar questions, and also familiarize yourself with:
|
||||
|
||||
* [user documentation](http://kubernetes.io/docs/)
|
||||
* [troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
|
||||
|
||||
Again, thanks for using Kubernetes.
|
||||
|
||||
The Kubernetes Team
|
||||
```
|
||||
|
||||
## Find the right SIG(s)
|
||||
Components are divided among [Special Interest Groups (SIGs)](https://github.com/kubernetes/community/blob/master/sig-list.md). Find a proper SIG for the ownership of the issue using the bot:
|
||||
|
|
|
@ -1,50 +0,0 @@
|
|||
# Kubernetes BuildCop Workflow
|
||||
|
||||
June 2017
|
||||
|
||||
## Objective
|
||||
|
||||
This document describes the responsibilities and the workflow of a person assuming the buildcop role.
|
||||
The current buildcop can be found [here](https://storage.googleapis.com/kubernetes-jenkins/oncall.html).
|
||||
|
||||
## Prerequisites for build-copping
|
||||
|
||||
- Ensure you have admin access to [http://github.com/kubernetes/kubernetes](http://github.com/kubernetes/kubernetes)
|
||||
- Check your membership in the GitHub team: [kubernetes-build-cops](https://github.com/orgs/kubernetes/teams/kubernetes-build-cops/members).
|
||||
If you are not a member contact one of the team maintainers to get yourself added to it.
|
||||
- Test your admin access by e.g. adding a label to an issue.
|
||||
- You must communicate any concerns/actions via the **#sig-release** slack channel to ensure that
|
||||
the release team has context on the current state of the submit queue.
|
||||
- You must attend the release burndown meeting to provide an update on the current state of the submit-queue
|
||||
|
||||
## Responsibilities
|
||||
|
||||
The build-cop's primary responsibility is to ensure that automatic merges are happening at a
|
||||
**reasonable** rate. This may include performing merging of test flake PRs when the pre-submits
|
||||
are failing repeatedly. The buildcop must be familiar with the
|
||||
[queue labels](https://submit-queue.k8s.io/#/info) and apply them as necessary to critical fixes.
|
||||
The priority labels are defunct and no longer respected by the submit-queue. As of June 2017,
|
||||
the merge rate is ~30 PRs per day if there are that many PRs in the queue. The previous
|
||||
responsibilities of this role included classification of incoming issues, but that is no
|
||||
longer a part of the mandate.
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Check the Prow batch dashboard: [https://prow.k8s.io/?type=batch](https://prow.k8s.io/?type=batch)
|
||||
to ensure that batch jobs are running regularly. It's okay to see occasional flakes. Do not worry
|
||||
about manually re-running individual tests, since Prow will rerun them.
|
||||
2. If there are post-submit blocking jobs (see [link](https://submit-queue.k8s.io/#/e2e)), ensure
|
||||
that those builds are green and allowing merges to occur.
|
||||
3. If several batch merges are failing, file an issue for that job and describe the possible
|
||||
causes for the failure. Debug if possible, else triage and assign to a particular SIG, and
|
||||
@-mention the maintainers. For example, see:
|
||||
[#47135](https://github.com/kubernetes/kubernetes/issues/47135)
|
||||
4. Communicate the actions to **#sig-release** via slack and ensure that the issue is being worked on.
|
||||
5. If the issue is not worked on for several hours, please escalate to the release team.
|
||||
The release team members can be found via the [features](https://github.com/kubernetes/features) repo.
|
||||
For example, the Kubernetes 1.7 release team members are listed [here](https://github.com/kubernetes/features/blob/master/release-1.7/release_team.md).
|
||||
Notify the release manager/release team members via GitHub mentions and slack.
|
||||
6. When the SIG member sends a fix, manually merge if necessary, after verifying that pre-submits pass,
|
||||
or use the 'retest-not-required' label with the appropriate 'queue/*' label to ensure merge of the
|
||||
flake fix.
|
||||
7. Issue an update to the **#sig-release** channel on the merge rate and the PR that was used to fix the queue.
|
|
@ -1,43 +0,0 @@
|
|||
## Kubernetes On-Call Rotations
|
||||
|
||||
### Kubernetes "first responder" rotations
|
||||
|
||||
Kubernetes has generated a lot of public traffic: email, pull-requests, bugs,
|
||||
etc. So much traffic that it's becoming impossible to keep up with it all! This
|
||||
is a fantastic problem to have. In order to be sure that SOMEONE, but not
|
||||
EVERYONE on the team is paying attention to public traffic, we have instituted
|
||||
two "first responder" rotations, listed below. Please read this page before
|
||||
proceeding to the pages linked below, which are specific to each rotation.
|
||||
|
||||
Please also read our [notes on OSS collaboration](collab.md), particularly the
|
||||
bits about hours. Specifically, each rotation is expected to be active primarily
|
||||
during work hours, less so off hours.
|
||||
|
||||
During regular workday work hours of your shift, your primary responsibility is
|
||||
to monitor the traffic sources specific to your rotation. You can check traffic
|
||||
in the evenings if you feel so inclined, but it is not expected to be as highly
|
||||
focused as work hours. For weekends, you should check traffic very occasionally
|
||||
(e.g. once or twice a day). Again, it is not expected to be as highly focused as
|
||||
workdays. It is assumed that over time, everyone will get weekday and weekend
|
||||
shifts, so the workload will balance out.
|
||||
|
||||
If you can not serve your shift, and you know this ahead of time, it is your
|
||||
responsibility to find someone to cover and to change the rotation. If you have
|
||||
an emergency, your responsibilities fall on the primary of the other rotation,
|
||||
who acts as your secondary. If you need help to cover all of the tasks, partners
|
||||
with oncall rotations (e.g.,
|
||||
[Redhat](https://github.com/orgs/kubernetes/teams/rh-oncall)).
|
||||
|
||||
If you are not on duty you DO NOT need to do these things. You are free to focus
|
||||
on "real work".
|
||||
|
||||
Note that Kubernetes will occasionally enter code slush/freeze, prior to
|
||||
milestones. When it does, there might be changes in the instructions (assigning
|
||||
milestones, for instance).
|
||||
|
||||
* [Github and Build Cop Rotation](on-call-build-cop.md)
|
||||
* [User Support Rotation](on-call-user-support.md)
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
|
@ -1,83 +0,0 @@
|
|||
## Kubernetes "User Support" Rotation
|
||||
|
||||
### Traffic sources and responsibilities
|
||||
|
||||
* [Stack Overflow](http://stackoverflow.com/questions/tagged/kubernetes) and
|
||||
[ServerFault](http://serverfault.com/questions/tagged/google-kubernetes):
|
||||
Respond to any thread that has no responses and is more than 6 hours old (over
|
||||
time we will lengthen this timeout to allow community responses). If you are not
|
||||
equipped to respond, it is your job to redirect to someone who can.
|
||||
|
||||
* [Query for unanswered Kubernetes Stack Overflow questions](http://stackoverflow.com/search?tab=newest&q=%5bkubernetes%5d%20answers%3a0)
|
||||
* [Query for unanswered Kubernetes ServerFault questions](https://serverfault.com/search?tab=newest&q=%5bgoogle-kubernetes%5d%20answers%3a0)
|
||||
* Direct poorly formulated questions to [Stack Overflow's tips about how to ask](http://stackoverflow.com/help/how-to-ask)
|
||||
* Direct off-topic questions to [Stack Overflow's policy](http://stackoverflow.com/help/on-topic)
|
||||
|
||||
* [Slack](https://kubernetes.slack.com) ([registration](http://slack.k8s.io)):
|
||||
Your job is to be on Slack, watching for questions and answering or redirecting
|
||||
as needed, such as to a SIG-specific channel. Please especially watch
|
||||
`#kubernetes-users` and `#kubernetes-novice`. Also check out the
|
||||
[Slack Archive](http://kubernetes.slackarchive.io/).
|
||||
|
||||
* [Email/Groups](https://groups.google.com/forum/#!forum/kubernetes-users):
|
||||
Respond to any thread that has no responses and is more than 6 hours old (over
|
||||
time we will lengthen this timeout to allow community responses). If you are not
|
||||
equipped to respond, it is your job to redirect to someone who can.
|
||||
|
||||
* on slack: Respond to questions that
|
||||
don't get answers.
|
||||
|
||||
In general, try to direct support questions to:
|
||||
|
||||
1. Documentation, such as the [user documentation](https://kubernetes.io/docs/) and
|
||||
[troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
|
||||
|
||||
2. Stack Overflow
|
||||
|
||||
#### User support response example
|
||||
|
||||
If you see questions on kubernetes-dev@googlegroups.com, try to redirect them
|
||||
to Stack Overflow. Example response:
|
||||
|
||||
```code
|
||||
Please re-post your question to [Stack Overflow]
|
||||
(http://stackoverflow.com/questions/tagged/kubernetes).
|
||||
|
||||
We are trying to consolidate the channels to which questions for help/support
|
||||
are posted so that we can improve our efficiency in responding to your requests,
|
||||
and to make it easier for you to find answers to frequently asked questions and
|
||||
how to address common use cases.
|
||||
|
||||
We regularly see messages posted in multiple forums, with the full response
|
||||
thread only in one place or, worse, spread across multiple forums. Also, the
|
||||
large volume of support issues on github is making it difficult for us to use
|
||||
issues to identify real bugs.
|
||||
|
||||
The Kubernetes team scans Stack Overflow on a regular basis, and will try to
|
||||
ensure your questions don't go unanswered.
|
||||
|
||||
Before posting a new question, please search Stack Overflow for answers to
|
||||
similar questions, and also familiarize yourself with:
|
||||
|
||||
* [user documentation](http://kubernetes.io/docs/)
|
||||
* [troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/)
|
||||
|
||||
Again, thanks for using Kubernetes.
|
||||
|
||||
The Kubernetes Team
|
||||
```
|
||||
|
||||
If you answer a question (in any of the above forums) that you think might be
|
||||
useful for someone else in the future, please send a PR or file an issue in
|
||||
[kubernetes.github.io](https://github.com/kubernetes/kubernetes.github.io).
|
||||
|
||||
### Contact information
|
||||
|
||||
[@k8s-oncall](https://github.com/k8s-oncall) will reach the
|
||||
current person on call.
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
|
@ -81,10 +81,11 @@ When a test is failing, it must be quickly escalated to the correct owner. Test
|
|||
are left to fail for days or weeks become toxic and create noise in the system health
|
||||
metrics.
|
||||
|
||||
The [build cop] is expected to ensure that the release blocking tests remain
|
||||
Each SIG is expected to ensure that the release blocking tests that belong to the SIG remain
|
||||
perpetually healthy by monitoring the test grid and escalating failures.
|
||||
|
||||
On test failures, the build cop will follow the [sig escalation](#sig-test-escalation) path.
|
||||
Failing tests that are not being addressed, can be escalated by following the
|
||||
[sig escalation](#sig-test-escalation) path.
|
||||
|
||||
*Tests without a responsive owner should be assigned a new owner or disabled.*
|
||||
|
||||
|
@ -132,14 +133,11 @@ urgent than persistent failures, but still expected to have a root cause investi
|
|||
|
||||
## Broken test workflow
|
||||
|
||||
SIGs are expected to proactively monitor and maintain their tests. The build cop will also
|
||||
monitor the health of the entire project, but is intended as backup who will escalate
|
||||
failures to the owning SIGs.
|
||||
SIGs are expected to proactively monitor and maintain their tests.
|
||||
|
||||
- File an issue for the broken test so it can be referenced and discovered
|
||||
- Set the following labels: `priority/failing-test`, `sig/*`
|
||||
- Assign the issue to whoever is working on it
|
||||
- Mention the current build cop (TODO: publish this somewhere)
|
||||
- Root cause analysis of the test failure is performed by the owner
|
||||
- **Note**: The owning SIG for a test can reassign ownership of a resolution to another SIG only after getting
|
||||
approval from that SIG
|
||||
|
@ -152,13 +150,11 @@ failures to the owning SIGs.
|
|||
|
||||
## SIG test escalation
|
||||
|
||||
The build cop should monitor the overall test health of the project, and ensure ownership for any given
|
||||
test does not fall through the cracks. When the build cop observer a test failure, they should first
|
||||
search to see if an issue has been filed already, and if not (optionally file an issue and) escalate to the SIG
|
||||
escalation point. If the escalation point is unresponsive within a day, the build cop should escalate to the SIG
|
||||
googlegroup and/or slack channel, mentioning the SIG leads. If escalation through the SIG googlegroup,
|
||||
slack channel and SIG leads is unsuccessful, the build cop should escalate to SIG release through the
|
||||
googlegroup and slack - mentioning the SIG leads.
|
||||
As a Kubernetes developers if you observe a test failure, first search to see if an issue has been filed already,
|
||||
and if not (optionally file an issue and) escalate to the SIG escalation point.
|
||||
If the escalation point is unresponsive within a day, escalate to the SIG googlegroup and/or slack channel,
|
||||
mentioning the SIG leads. If escalation through the SIG googlegroup, slack channel and SIG leads is unsuccessful,
|
||||
escalate to SIG release through the googlegroup and slack - mentioning the SIG leads.
|
||||
|
||||
The SIG escalation points should be bootstrapped from the [community sig list].
|
||||
|
||||
|
@ -172,7 +168,6 @@ The SIG escalation points should be bootstrapped from the [community sig list].
|
|||
[community sig list]: https://github.com/kubernetes/community/blob/master/sig-list.md
|
||||
[triage tool]: https://storage.googleapis.com/k8s-gubernator/triage/index.html
|
||||
[test grid]: https://k8s-testgrid.appspot.com/
|
||||
[build cop]: https://github.com/kubernetes/community/blob/master/contributors/devel/on-call-build-cop.md
|
||||
[release-master-blocking]: https://k8s-testgrid.appspot.com/release-master-blocking#Summary
|
||||
[1.7-master-upgrade]: https://k8s-testgrid.appspot.com/1.7-master-upgrade#Summary
|
||||
[1.6-master-upgrade]: https://k8s-testgrid.appspot.com/1.6-master-upgrade#Summary
|
||||
|
|
Loading…
Reference in New Issue