56 lines
2.8 KiB
Markdown
56 lines
2.8 KiB
Markdown
# Blocking PR merges in the event of regression.
|
|
|
|
As mentioned in the charter, SIG scalability has a right to block all PRs
|
|
from merging into the relevant repos. This document describes the underlying
|
|
"Rules of engagement" of this process and the rationale why this is needed.
|
|
|
|
### Rules of engagement.
|
|
The rules of engagement for blocking merges are as following:
|
|
|
|
- Observe as scalability regression on one of release-blocking test suites
|
|
(defined as green to red transition - if tests were already failing, we
|
|
don't have a right to declare a regression).
|
|
- Block merges of all PRs to the relevant repos in the affected branch,
|
|
declaring which repos those are and why.
|
|
- Identify the PR which caused the regression:
|
|
- this can be done by reading code changes, bisecting, debugging based on
|
|
metrics and/or logs, etc.
|
|
- we say a PR is identified as the cause when we are reasonably confident
|
|
that it indeed caused a regression, even if the mechanism is not 100%
|
|
understood to minimize the time when merges are blocked
|
|
- Mitigate the regression. This may mean e.g.:
|
|
- reverting the PR
|
|
- switching a feature off (preferably by default, as last resort only in tests)
|
|
- fixing the problem (if it's easy and quick to fix)
|
|
- Unblock merges of all PRs to the relevant repos in the affected branch.
|
|
|
|
The exact technical mechanisms for it are out of scope for this document.
|
|
|
|
### Rationale
|
|
The process described above is quite drastic, but we believe it is justified
|
|
if we want kubernetes to maintain scalability SLOs. The reasoning is:
|
|
- reliably testing for regressions takes a lot of time:
|
|
- key scalability e2e tests take too long to execute to be a prerequisite
|
|
for merging all PRs, this is an inherent characteristic of testing at scale,
|
|
- end-to-end tests are flaky (even when not at scale) requiring retries,
|
|
- we need to prevent regression pile-ups:
|
|
- once a regression is merged, and no other action is taken, it is only
|
|
a matter of time until another regression is merged on top of it,
|
|
- debugging the cause of two simultaneous (piled-up) regressions is
|
|
exponentially harder, see issue [53255](http://pr.k8s.io/53255) which
|
|
links to past experience
|
|
- we need to keep flakiness of merge-blocking jobs very low:
|
|
- regarding benchmarks, there were several scalability issues in the past
|
|
caught by (costly) large-scale e2e tests, which could have been caught and
|
|
fixed earlier and with far less human effort if we had benchmark-like
|
|
tests. Examples include:
|
|
- scheduler anti-affinity affecting kube-dns,
|
|
- kubelet network plugin increasing pod-startup latency,
|
|
- large responses from apiserver violating gRPC MTU.
|
|
|
|
As explained in detail in an issue, not being able to maintain passing scalability
|
|
tests adversely affect:
|
|
- release quality
|
|
- release schedule
|
|
- engineer productivity
|