community/sig-scalability/block_merges.md

56 lines
2.8 KiB
Markdown

# Blocking PR merges in the event of regression.
As mentioned in the charter, SIG scalability has a right to block all PRs
from merging into the relevant repos. This document describes the underlying
"Rules of engagement" of this process and the rationale why this is needed.
### Rules of engagement.
The rules of engagement for blocking merges are as following:
- Observe as scalability regression on one of release-blocking test suites
(defined as green to red transition - if tests were already failing, we
don't have a right to declare a regression).
- Block merges of all PRs to the relevant repos in the affected branch,
declaring which repos those are and why.
- Identify the PR which caused the regression:
- this can be done by reading code changes, bisecting, debugging based on
metrics and/or logs, etc.
- we say a PR is identified as the cause when we are reasonably confident
that it indeed caused a regression, even if the mechanism is not 100%
understood to minimize the time when merges are blocked
- Mitigate the regression. This may mean e.g.:
- reverting the PR
- switching a feature off (preferably by default, as last resort only in tests)
- fixing the problem (if it's easy and quick to fix)
- Unblock merges of all PRs to the relevant repos in the affected branch.
The exact technical mechanisms for it are out of scope for this document.
### Rationale
The process described above is quite drastic, but we believe it is justified
if we want kubernetes to maintain scalability SLOs. The reasoning is:
- reliably testing for regressions takes a lot of time:
- key scalability e2e tests take too long to execute to be a prerequisite
for merging all PRs, this is an inherent characteristic of testing at scale,
- end-to-end tests are flaky (even when not at scale) requiring retries,
- we need to prevent regression pile-ups:
- once a regression is merged, and no other action is taken, it is only
a matter of time until another regression is merged on top of it,
- debugging the cause of two simultaneous (piled-up) regressions is
exponentially harder, see issue [53255](http://pr.k8s.io/53255) which
links to past experience
- we need to keep flakiness of merge-blocking jobs very low:
- regarding benchmarks, there were several scalability issues in the past
caught by (costly) large-scale e2e tests, which could have been caught and
fixed earlier and with far less human effort if we had benchmark-like
tests. Examples include:
- scheduler anti-affinity affecting kube-dns,
- kubelet network plugin increasing pod-startup latency,
- large responses from apiserver violating gRPC MTU.
As explained in detail in an issue, not being able to maintain passing scalability
tests adversely affect:
- release quality
- release schedule
- engineer productivity