Adding kubeadm 1.6 postmortem.
This document was converted to markdown after review of the original Google doc: https://docs.google.com/document/d/1o-wJ39O8eqBxOCHgu0MGmgH4EdojzI3WlrsV3TVwhe0/edit?usp=sharing
This commit is contained in:
parent
2fd0199d43
commit
cdb06ee3f7
|
@ -0,0 +1,269 @@
|
|||
Kubernetes Postmortem: kubeadm 1.6.0 release
|
||||
============================================
|
||||
|
||||
**Incident Date:** 2017-03-28
|
||||
|
||||
**Owners:** Jacob Beacham (@pipejakob)
|
||||
|
||||
**Collaborators:** Joe Beda (@jbeda), Mike Danese (@mikedanese), Robert Bailey
|
||||
(@roberthbailey)
|
||||
|
||||
**Status:** \[draft | pending feedback | **final**\]
|
||||
|
||||
**Summary:** kubeadm 1.6.0 consistently hangs while trying to initialize new
|
||||
clusters. A fix required creating the 1.6.1 patch release six days after 1.6.0.
|
||||
|
||||
**Impact:** Initialization of a new 1.6.0 master using kubeadm.
|
||||
|
||||
**Root Causes:** kubelet’s behavior was changed to report NotReady instead of
|
||||
Ready when CNI was unconfigured
|
||||
([\#43474](https://github.com/kubernetes/kubernetes/pull/43474)), which caused
|
||||
kubeadm to hang indefinitely on initialization while waiting for the master node
|
||||
to become Ready (and then schedule a dummy deployment) in order to validate the
|
||||
control plane’s health, which was intended to happen before a CNI provider gets
|
||||
added.
|
||||
|
||||
**Resolution:** kubeadm initialization now only waits for the master node to
|
||||
register with the API server, but does not require it to be Ready, and does not
|
||||
attempt a dummy deployment to validate the control plane’s health
|
||||
([\#43835](https://github.com/kubernetes/kubernetes/pull/43835)). This behavior
|
||||
is being revisited for the 1.7 release.
|
||||
|
||||
**Detection:** A customer filed an issue against kubeadm after trying to
|
||||
initialize a new cluster with the 1.6.0 release
|
||||
([\#212](https://github.com/kubernetes/kubeadm/issues/212)).
|
||||
|
||||
**Lessons Learned:**
|
||||
|
||||
**What went well**
|
||||
|
||||
- The bug was discovered quickly after the release.
|
||||
|
||||
- Once the bug was discovered, a solution was ready within a day, and the patch
|
||||
release was available five days after that (on a Monday, so the weekend
|
||||
accounted for some of the gap).
|
||||
|
||||
**What went wrong**
|
||||
|
||||
- The 1.6.0 release of kubeadm never passed end-to-end testing.
|
||||
|
||||
- End-to-end tests only existed against the master branch instead of the
|
||||
release-1.6 branch.
|
||||
|
||||
- Conformance testing of kubeadm requires a functioning CNI provider, but due
|
||||
to changes in 1.6.0 clusters and CNI itself, previous kubeadm-endorsed CNI
|
||||
providers required updating to reflect the new master taint, RBAC being
|
||||
enabled, the master’s insecure port being disabled, and to tolerate deletion
|
||||
of unknown pods. Without a functional 1.6 CNI provider until very late in
|
||||
the development cycle, Conformance tests were disabled for kubeadm’s
|
||||
end-to-end jobs in favor of only testing initialization and node joining.
|
||||
|
||||
- The kubeadm end-to-end tests were also constantly breaking throughout the
|
||||
development cycle due to upstream kubeadm and test-infra changes. There was
|
||||
no automated monitoring to notify the SIG of failures, nor was there a
|
||||
process defined for fixing them, which led to a single person manually
|
||||
watching them and addressing failures as they occurred. As a result, manual
|
||||
testing was being used near the release milestone, and was successfully
|
||||
passing through the 1.6.0-beta.4 release. Without coordination with the
|
||||
Release Czar, 1.6.0-rc.1 and 1.6.0 were released without manual end-to-end
|
||||
testing of kubeadm and contained the regression.
|
||||
|
||||
- There was a lot of rushing to get the release ready before KubeCon EU, causing
|
||||
a shortened timeframe for RC and release, lowered bandwidth for SIG members,
|
||||
and the last SIG meeting before the release to be cancelled, which decreased
|
||||
communication.
|
||||
|
||||
- There was no explicit release-readiness sign-off by the SIG. The SIG had
|
||||
checklists for bringing kubeadm to Beta (the goal for 1.6.0), and they
|
||||
included end-to-end tests which were knowingly in a bad state, but no one
|
||||
escalated to delay the release or to remove kubeadm’s Beta status.
|
||||
|
||||
- After the 1.6.0 bug was discovered, there was no public announcement to let
|
||||
users know about the flaw or the timeline to expect a fix.
|
||||
|
||||
- There were two GitHub issues
|
||||
([kubeadm\#212](https://github.com/kubernetes/kubeadm/issues/212) and
|
||||
[kubernetes\#43815](https://github.com/kubernetes/kubernetes/issues/43815))
|
||||
both tracking the bug. Both were flooded by users duplicating the bug report
|
||||
or their workarounds, with splintered developer discussions for the short-term
|
||||
and long-term fixes, which made the issue noisy for anyone who just wanted
|
||||
updates on the status of the official fix. Additional communication occurred
|
||||
on Slack channels, so there was no single authoritative source to follow for
|
||||
updates.
|
||||
|
||||
- Older versions of the kubeadm Debian packages were removed when 1.6.0 was
|
||||
released, so users could not fall back on older versions of kubeadm unless
|
||||
they had cached versions. This was intentional for this release (since prior
|
||||
versions were Alpha and insecure), and shouldn’t happen in future releases,
|
||||
but it left some users out of luck who were knowingly depending on kubeadm 1.5
|
||||
or wanted to fall back after 1.6.0 failed for them.
|
||||
|
||||
**Where we got lucky**
|
||||
|
||||
- This bug only manifested during cluster initialization, and occurred
|
||||
consistently. This meant that it was detected very quickly, was trivial to
|
||||
reproduce, and had minimal impact on customers since they could not have been
|
||||
relying on the cluster yet. If the bug had been more subtle, it could have
|
||||
been triggered at random points during the lifecycle of a cluster, been more
|
||||
difficult to reproduce and fix, and caused harm to clusters that were already
|
||||
in use by customers.
|
||||
|
||||
- Even without full testing, there were no other kubeadm regressions between
|
||||
1.6.0-rc.1 and 1.6.0.
|
||||
|
||||
**Action Items:**
|
||||
|
||||
| **Item** | **Type** | **Owner** | **Issue** |
|
||||
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|-----------|--------------------------------------------------------------------------|
|
||||
| Add end-to-end kubeadm postsubmit tests for release-1.6 branch | detect | pipejakob | DONE |
|
||||
| Add end-to-end kubeadm presubmit tests (non-blocking) | prevent | | [kubeadm\#250](https://github.com/kubernetes/kubeadm/issues/250) |
|
||||
| Add end-to-end kubeadm variants that use non-third-party CNI providers, like “bridge” | prevent | | [kubeadm\#218](https://github.com/kubernetes/kubeadm/issues/218) |
|
||||
| Notify SIG on kubeadm postsubmit end-to-end test failures | detect | | [test-infra\#2555](https://github.com/kubernetes/test-infra/issues/2555) |
|
||||
| Define process of who should triage and/or fix kubeadm end-to-end test failures, and how | prevent | | [kubeadm\#251](https://github.com/kubernetes/kubeadm/issues/251) |
|
||||
| Do not remove old versions from distribution repositories during release | mitigate | | [kubeadm\#252](https://github.com/kubernetes/kubeadm/issues/252) |
|
||||
| Define kubeadm release process that blocks future releases on its completion (e.g. setup end-to-end tests for new release branch, when and how to make the go/no-go decision) | prevent | | [kubeadm\#252](https://github.com/kubernetes/kubeadm/issues/252) |
|
||||
| Document incident response process for critically flawed Kubernetes releases, including how to notify the community and track progress to conclusion | mitigate | | [community\#564](https://github.com/kubernetes/community/issues/564) |
|
||||
|
||||
**Timeline**
|
||||
|
||||
All times are in 24-hour PST8PDT.
|
||||
|
||||
**2017/02/24**
|
||||
|
||||
> 06:00 fejta changes e2e-runner.sh
|
||||
> ([test-infra\#1657](https://github.com/kubernetes/test-infra/pull/1657)),
|
||||
> inadvertently regresses kubeadm e2e test
|
||||
|
||||
**2017/03/08**
|
||||
|
||||
> 13:44 pipejakob fixes regression
|
||||
> ([test-infra\#2179](https://github.com/kubernetes/test-infra/pull/2179)), but
|
||||
> the e2e test is still failing because of recent kubeadm CLI changes
|
||||
|
||||
**2017/03/08**
|
||||
|
||||
> 13:22 spxtr refactors prow config
|
||||
> ([test-infra\#2192](https://github.com/kubernetes/test-infra/pull/2192)),
|
||||
> which later breaks kubeadm e2e job configuration when it gets pushed (this
|
||||
> timestamp is for the merge, but actual activation of config is unknown since
|
||||
> it is done manually)
|
||||
|
||||
**2017/03/09**
|
||||
|
||||
> 21:43 pipejakob merges commit to accommodate recent kubeadm CLI changes to
|
||||
> attempt to fix e2e jobs
|
||||
> ([kubernetes-anywhere\#352](https://github.com/kubernetes/kubernetes-anywhere/pull/352))
|
||||
|
||||
**2017/03/13**
|
||||
|
||||
> 11:27 pipejakob temporarily disables kubeadm e2e Conformance testing
|
||||
> ([test-infra\#2184](https://github.com/kubernetes/test-infra/pull/2184)) to
|
||||
> get a better signal; test runs are back to green but only exercise
|
||||
> initializing the cluster and verifying that nodes join correctly
|
||||
>
|
||||
> 12:01 while still trying to fix CNI providers on kubeadm e2e test, pipejakob
|
||||
> finds that even after accounting for expected changes (master taint renaming,
|
||||
> RBAC being enabled, unauthenticated access being turned off), CNI providers
|
||||
> still aren’t working
|
||||
> ([kubeadm\#190](https://github.com/kubernetes/kubeadm/issues/190#issuecomment-286209644))
|
||||
|
||||
**2017/03/14**
|
||||
|
||||
> 13:11 pipejakob fixes kubeadm e2e job configuration (which was pushed at some
|
||||
> point after spxtr’s prow configuration refactoring)
|
||||
> ([test-infra\#2246](https://github.com/kubernetes/test-infra/pull/2246))
|
||||
|
||||
**2017/03/16**
|
||||
|
||||
> 11:23 bboreham fixes the weave-net CNI provider
|
||||
> ([weave\#2850](https://github.com/weaveworks/weave/pull/2850)) to account for
|
||||
> “CNI unknown pod deletion” change
|
||||
>
|
||||
> 14:52 krzyzacy migrates kubeadm e2e job to be scenario/json based
|
||||
> ([test-infra\#2141](https://github.com/kubernetes/test-infra/pull/2141)),
|
||||
> which breaks the job.
|
||||
>
|
||||
> Over the next few days, krzyzacy tries to fix the above regression, but the
|
||||
> job is ultimately left failing because Conformance testing has been
|
||||
> erroneously re-enabled, which is known to be broken due to CNI issues
|
||||
> ([test-infra\#2280](https://github.com/kubernetes/test-infra/pull/2280),
|
||||
> [test-infra\#2284](https://github.com/kubernetes/test-infra/pull/2284),
|
||||
> [test-infra\#2285](https://github.com/kubernetes/test-infra/pull/2285),
|
||||
> [test-infra\#2286](https://github.com/kubernetes/test-infra/pull/2286),
|
||||
> [test-infra\#2288](https://github.com/kubernetes/test-infra/pull/2288))
|
||||
|
||||
**2017/03/17**
|
||||
|
||||
> 10:41 enisoc releases
|
||||
> [1.6.0-beta.4](https://github.com/kubernetes/kubernetes/commit/b202120be3a97e5f8a5e20da51d0b6f5a1eebd31).
|
||||
> Since e2e tests are broken, pipejakob manually tests cluster initialization
|
||||
> locally (kubeadm still works), as well as the updated weave-net manifest
|
||||
|
||||
**2017/03/22**
|
||||
|
||||
> 12:35 dcbw merges change to make kubelet report NotReady when CNI is
|
||||
> unconfigured
|
||||
> ([kubernetes\#43474](https://github.com/kubernetes/kubernetes/pull/43474)),
|
||||
> but e2e tests are already failing so no one notices the kubeadm regression
|
||||
|
||||
**2017/03/24**
|
||||
|
||||
> 12:06 enisoc releases
|
||||
> [1.6.0-rc.1](https://github.com/kubernetes/kubernetes/commit/8ea07d1fd277de8ab5ea7f281766760bcb7d0fe5).
|
||||
> This was the first release to regress kubeadm, but it goes untested.
|
||||
|
||||
**2017/03/28**
|
||||
|
||||
> 09:23 enisoc releases Kubernetes
|
||||
> [1.6.0](https://github.com/kubernetes/kubernetes/releases/tag/v1.6.0).
|
||||
>
|
||||
> 09:27 pipejakob updates kubeadm e2e job to use weave-net plugin so that
|
||||
> Conformance testing can be re-enabled
|
||||
> ([test-infra\#2347](https://github.com/kubernetes/test-infra/pull/2347)), but
|
||||
> due to the subtle [gcloud ssh
|
||||
> bug](https://github.com/kubernetes/kubeadm/issues/219), the job is still
|
||||
> broken after the update, so it masks the new regression in kubeadm init
|
||||
>
|
||||
> 22:40 jimmycuadra reports kubeadm 1.6.0 being broken ([kubeadm issue
|
||||
> 212](https://github.com/kubernetes/kubeadm/issues/212))
|
||||
|
||||
**2017/03/29**
|
||||
|
||||
> 13:04 kensimon opens PR to fix kubeadm master: “Tolerate node network not
|
||||
> being ready“
|
||||
> ([kubernetes\#43824](https://github.com/kubernetes/kubernetes/pull/43824))
|
||||
>
|
||||
> 18:29 mikedanese opens second PR to fix kubeadm master in different way:
|
||||
> “don't wait for first kubelet to be ready and drop dummy deploy.”
|
||||
> ([kubernetes\#43835](https://github.com/kubernetes/kubernetes/pull/43835))
|
||||
> pipejakob helps manually test it for QA purposes.
|
||||
>
|
||||
> 18:51 mikedanese opens PR for cherry-pick of above fix to release-1.6 branch
|
||||
> ([kubernetes\#43837](https://github.com/kubernetes/kubernetes/pull/43837))
|
||||
|
||||
**2017/03/30**
|
||||
|
||||
> 16:57 mikedanese’s kubeadm fix
|
||||
> ([kubernetes\#43835](https://github.com/kubernetes/kubernetes/pull/43835)) is
|
||||
> merged to master (kensimon’s gets discarded)
|
||||
>
|
||||
> 21:57 mikedanese adds new build of .deb to kubernetes-xenial-unstable channel
|
||||
> for users to test ([kubernetes issue
|
||||
> 43815](https://github.com/kubernetes/kubernetes/issues/43815#issuecomment-290616036))
|
||||
|
||||
**2017/03/31**
|
||||
|
||||
> 00:26 mikedanese’s cherry-pick merged to release-1.6 branch
|
||||
> ([kubernetes\#43837](https://github.com/kubernetes/kubernetes/pull/43837))
|
||||
>
|
||||
> 11:34 pipejakob merges CI job for release-1.6 branch
|
||||
> ([test-infra\#2352](https://github.com/kubernetes/test-infra/pull/2352))
|
||||
>
|
||||
> 16:30 pipejakob merges quick fix
|
||||
> ([test-infra\#2380](https://github.com/kubernetes/test-infra/pull/2380)) for
|
||||
> the “gcloud ssh issue,” which fixes Conformance testing
|
||||
|
||||
**2017/04/03**
|
||||
|
||||
> 13:32 enisoc releases Kubernetes
|
||||
> [1.6.1](https://github.com/kubernetes/kubernetes/releases/tag/v1.6.1)
|
Loading…
Reference in New Issue