From cdb06ee3f7a821178b2ac15f240cfbb39aed885a Mon Sep 17 00:00:00 2001 From: Jacob Beacham Date: Fri, 21 Apr 2017 11:07:14 -0700 Subject: [PATCH] Adding kubeadm 1.6 postmortem. This document was converted to markdown after review of the original Google doc: https://docs.google.com/document/d/1o-wJ39O8eqBxOCHgu0MGmgH4EdojzI3WlrsV3TVwhe0/edit?usp=sharing --- .../postmortems/kubeadm-1.6.md | 269 ++++++++++++++++++ 1 file changed, 269 insertions(+) create mode 100644 sig-cluster-lifecycle/postmortems/kubeadm-1.6.md diff --git a/sig-cluster-lifecycle/postmortems/kubeadm-1.6.md b/sig-cluster-lifecycle/postmortems/kubeadm-1.6.md new file mode 100644 index 000000000..fe828b549 --- /dev/null +++ b/sig-cluster-lifecycle/postmortems/kubeadm-1.6.md @@ -0,0 +1,269 @@ +Kubernetes Postmortem: kubeadm 1.6.0 release +============================================ + +**Incident Date:** 2017-03-28 + +**Owners:** Jacob Beacham (@pipejakob) + +**Collaborators:** Joe Beda (@jbeda), Mike Danese (@mikedanese), Robert Bailey +(@roberthbailey) + +**Status:** \[draft | pending feedback | **final**\] + +**Summary:** kubeadm 1.6.0 consistently hangs while trying to initialize new +clusters. A fix required creating the 1.6.1 patch release six days after 1.6.0. + +**Impact:** Initialization of a new 1.6.0 master using kubeadm. + +**Root Causes:** kubelet’s behavior was changed to report NotReady instead of +Ready when CNI was unconfigured +([\#43474](https://github.com/kubernetes/kubernetes/pull/43474)), which caused +kubeadm to hang indefinitely on initialization while waiting for the master node +to become Ready (and then schedule a dummy deployment) in order to validate the +control plane’s health, which was intended to happen before a CNI provider gets +added. + +**Resolution:** kubeadm initialization now only waits for the master node to +register with the API server, but does not require it to be Ready, and does not +attempt a dummy deployment to validate the control plane’s health +([\#43835](https://github.com/kubernetes/kubernetes/pull/43835)). This behavior +is being revisited for the 1.7 release. + +**Detection:** A customer filed an issue against kubeadm after trying to +initialize a new cluster with the 1.6.0 release +([\#212](https://github.com/kubernetes/kubeadm/issues/212)). + +**Lessons Learned:** + +**What went well** + +- The bug was discovered quickly after the release. + +- Once the bug was discovered, a solution was ready within a day, and the patch + release was available five days after that (on a Monday, so the weekend + accounted for some of the gap). + +**What went wrong** + +- The 1.6.0 release of kubeadm never passed end-to-end testing. + + - End-to-end tests only existed against the master branch instead of the + release-1.6 branch. + + - Conformance testing of kubeadm requires a functioning CNI provider, but due + to changes in 1.6.0 clusters and CNI itself, previous kubeadm-endorsed CNI + providers required updating to reflect the new master taint, RBAC being + enabled, the master’s insecure port being disabled, and to tolerate deletion + of unknown pods. Without a functional 1.6 CNI provider until very late in + the development cycle, Conformance tests were disabled for kubeadm’s + end-to-end jobs in favor of only testing initialization and node joining. + + - The kubeadm end-to-end tests were also constantly breaking throughout the + development cycle due to upstream kubeadm and test-infra changes. There was + no automated monitoring to notify the SIG of failures, nor was there a + process defined for fixing them, which led to a single person manually + watching them and addressing failures as they occurred. As a result, manual + testing was being used near the release milestone, and was successfully + passing through the 1.6.0-beta.4 release. Without coordination with the + Release Czar, 1.6.0-rc.1 and 1.6.0 were released without manual end-to-end + testing of kubeadm and contained the regression. + +- There was a lot of rushing to get the release ready before KubeCon EU, causing + a shortened timeframe for RC and release, lowered bandwidth for SIG members, + and the last SIG meeting before the release to be cancelled, which decreased + communication. + +- There was no explicit release-readiness sign-off by the SIG. The SIG had + checklists for bringing kubeadm to Beta (the goal for 1.6.0), and they + included end-to-end tests which were knowingly in a bad state, but no one + escalated to delay the release or to remove kubeadm’s Beta status. + +- After the 1.6.0 bug was discovered, there was no public announcement to let + users know about the flaw or the timeline to expect a fix. + +- There were two GitHub issues + ([kubeadm\#212](https://github.com/kubernetes/kubeadm/issues/212) and + [kubernetes\#43815](https://github.com/kubernetes/kubernetes/issues/43815)) + both tracking the bug. Both were flooded by users duplicating the bug report + or their workarounds, with splintered developer discussions for the short-term + and long-term fixes, which made the issue noisy for anyone who just wanted + updates on the status of the official fix. Additional communication occurred + on Slack channels, so there was no single authoritative source to follow for + updates. + +- Older versions of the kubeadm Debian packages were removed when 1.6.0 was + released, so users could not fall back on older versions of kubeadm unless + they had cached versions. This was intentional for this release (since prior + versions were Alpha and insecure), and shouldn’t happen in future releases, + but it left some users out of luck who were knowingly depending on kubeadm 1.5 + or wanted to fall back after 1.6.0 failed for them. + +**Where we got lucky** + +- This bug only manifested during cluster initialization, and occurred + consistently. This meant that it was detected very quickly, was trivial to + reproduce, and had minimal impact on customers since they could not have been + relying on the cluster yet. If the bug had been more subtle, it could have + been triggered at random points during the lifecycle of a cluster, been more + difficult to reproduce and fix, and caused harm to clusters that were already + in use by customers. + +- Even without full testing, there were no other kubeadm regressions between + 1.6.0-rc.1 and 1.6.0. + +**Action Items:** + +| **Item** | **Type** | **Owner** | **Issue** | +|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|-----------|--------------------------------------------------------------------------| +| Add end-to-end kubeadm postsubmit tests for release-1.6 branch | detect | pipejakob | DONE | +| Add end-to-end kubeadm presubmit tests (non-blocking) | prevent | | [kubeadm\#250](https://github.com/kubernetes/kubeadm/issues/250) | +| Add end-to-end kubeadm variants that use non-third-party CNI providers, like “bridge” | prevent | | [kubeadm\#218](https://github.com/kubernetes/kubeadm/issues/218) | +| Notify SIG on kubeadm postsubmit end-to-end test failures | detect | | [test-infra\#2555](https://github.com/kubernetes/test-infra/issues/2555) | +| Define process of who should triage and/or fix kubeadm end-to-end test failures, and how | prevent | | [kubeadm\#251](https://github.com/kubernetes/kubeadm/issues/251) | +| Do not remove old versions from distribution repositories during release | mitigate | | [kubeadm\#252](https://github.com/kubernetes/kubeadm/issues/252) | +| Define kubeadm release process that blocks future releases on its completion (e.g. setup end-to-end tests for new release branch, when and how to make the go/no-go decision) | prevent | | [kubeadm\#252](https://github.com/kubernetes/kubeadm/issues/252) | +| Document incident response process for critically flawed Kubernetes releases, including how to notify the community and track progress to conclusion | mitigate | | [community\#564](https://github.com/kubernetes/community/issues/564) | + +**Timeline** + +All times are in 24-hour PST8PDT. + +**2017/02/24** + +> 06:00 fejta changes e2e-runner.sh +> ([test-infra\#1657](https://github.com/kubernetes/test-infra/pull/1657)), +> inadvertently regresses kubeadm e2e test + +**2017/03/08** + +> 13:44 pipejakob fixes regression +> ([test-infra\#2179](https://github.com/kubernetes/test-infra/pull/2179)), but +> the e2e test is still failing because of recent kubeadm CLI changes + +**2017/03/08** + +> 13:22 spxtr refactors prow config +> ([test-infra\#2192](https://github.com/kubernetes/test-infra/pull/2192)), +> which later breaks kubeadm e2e job configuration when it gets pushed (this +> timestamp is for the merge, but actual activation of config is unknown since +> it is done manually) + +**2017/03/09** + +> 21:43 pipejakob merges commit to accommodate recent kubeadm CLI changes to +> attempt to fix e2e jobs +> ([kubernetes-anywhere\#352](https://github.com/kubernetes/kubernetes-anywhere/pull/352)) + +**2017/03/13** + +> 11:27 pipejakob temporarily disables kubeadm e2e Conformance testing +> ([test-infra\#2184](https://github.com/kubernetes/test-infra/pull/2184)) to +> get a better signal; test runs are back to green but only exercise +> initializing the cluster and verifying that nodes join correctly +> +> 12:01 while still trying to fix CNI providers on kubeadm e2e test, pipejakob +> finds that even after accounting for expected changes (master taint renaming, +> RBAC being enabled, unauthenticated access being turned off), CNI providers +> still aren’t working +> ([kubeadm\#190](https://github.com/kubernetes/kubeadm/issues/190#issuecomment-286209644)) + +**2017/03/14** + +> 13:11 pipejakob fixes kubeadm e2e job configuration (which was pushed at some +> point after spxtr’s prow configuration refactoring) +> ([test-infra\#2246](https://github.com/kubernetes/test-infra/pull/2246)) + +**2017/03/16** + +> 11:23 bboreham fixes the weave-net CNI provider +> ([weave\#2850](https://github.com/weaveworks/weave/pull/2850)) to account for +> “CNI unknown pod deletion” change +> +> 14:52 krzyzacy migrates kubeadm e2e job to be scenario/json based +> ([test-infra\#2141](https://github.com/kubernetes/test-infra/pull/2141)), +> which breaks the job. +> +> Over the next few days, krzyzacy tries to fix the above regression, but the +> job is ultimately left failing because Conformance testing has been +> erroneously re-enabled, which is known to be broken due to CNI issues +> ([test-infra\#2280](https://github.com/kubernetes/test-infra/pull/2280), +> [test-infra\#2284](https://github.com/kubernetes/test-infra/pull/2284), +> [test-infra\#2285](https://github.com/kubernetes/test-infra/pull/2285), +> [test-infra\#2286](https://github.com/kubernetes/test-infra/pull/2286), +> [test-infra\#2288](https://github.com/kubernetes/test-infra/pull/2288)) + +**2017/03/17** + +> 10:41 enisoc releases +> [1.6.0-beta.4](https://github.com/kubernetes/kubernetes/commit/b202120be3a97e5f8a5e20da51d0b6f5a1eebd31). +> Since e2e tests are broken, pipejakob manually tests cluster initialization +> locally (kubeadm still works), as well as the updated weave-net manifest + +**2017/03/22** + +> 12:35 dcbw merges change to make kubelet report NotReady when CNI is +> unconfigured +> ([kubernetes\#43474](https://github.com/kubernetes/kubernetes/pull/43474)), +> but e2e tests are already failing so no one notices the kubeadm regression + +**2017/03/24** + +> 12:06 enisoc releases +> [1.6.0-rc.1](https://github.com/kubernetes/kubernetes/commit/8ea07d1fd277de8ab5ea7f281766760bcb7d0fe5). +> This was the first release to regress kubeadm, but it goes untested. + +**2017/03/28** + +> 09:23 enisoc releases Kubernetes +> [1.6.0](https://github.com/kubernetes/kubernetes/releases/tag/v1.6.0). +> +> 09:27 pipejakob updates kubeadm e2e job to use weave-net plugin so that +> Conformance testing can be re-enabled +> ([test-infra\#2347](https://github.com/kubernetes/test-infra/pull/2347)), but +> due to the subtle [gcloud ssh +> bug](https://github.com/kubernetes/kubeadm/issues/219), the job is still +> broken after the update, so it masks the new regression in kubeadm init +> +> 22:40 jimmycuadra reports kubeadm 1.6.0 being broken ([kubeadm issue +> 212](https://github.com/kubernetes/kubeadm/issues/212)) + +**2017/03/29** + +> 13:04 kensimon opens PR to fix kubeadm master: “Tolerate node network not +> being ready“ +> ([kubernetes\#43824](https://github.com/kubernetes/kubernetes/pull/43824)) +> +> 18:29 mikedanese opens second PR to fix kubeadm master in different way: +> “don't wait for first kubelet to be ready and drop dummy deploy.” +> ([kubernetes\#43835](https://github.com/kubernetes/kubernetes/pull/43835)) +> pipejakob helps manually test it for QA purposes. +> +> 18:51 mikedanese opens PR for cherry-pick of above fix to release-1.6 branch +> ([kubernetes\#43837](https://github.com/kubernetes/kubernetes/pull/43837)) + +**2017/03/30** + +> 16:57 mikedanese’s kubeadm fix +> ([kubernetes\#43835](https://github.com/kubernetes/kubernetes/pull/43835)) is +> merged to master (kensimon’s gets discarded) +> +> 21:57 mikedanese adds new build of .deb to kubernetes-xenial-unstable channel +> for users to test ([kubernetes issue +> 43815](https://github.com/kubernetes/kubernetes/issues/43815#issuecomment-290616036)) + +**2017/03/31** + +> 00:26 mikedanese’s cherry-pick merged to release-1.6 branch +> ([kubernetes\#43837](https://github.com/kubernetes/kubernetes/pull/43837)) +> +> 11:34 pipejakob merges CI job for release-1.6 branch +> ([test-infra\#2352](https://github.com/kubernetes/test-infra/pull/2352)) +> +> 16:30 pipejakob merges quick fix +> ([test-infra\#2380](https://github.com/kubernetes/test-infra/pull/2380)) for +> the “gcloud ssh issue,” which fixes Conformance testing + +**2017/04/03** + +> 13:32 enisoc releases Kubernetes +> [1.6.1](https://github.com/kubernetes/kubernetes/releases/tag/v1.6.1)