Merge pull request #1707 from fabriziopandini/kubeadm-join--master

KEP kubeadm join --master workflow
2018-04-02 13:34:15 -07:00 · 2018-04-02 13:34:15 -07:00 · fd283d1b24
parent 70f2f2475d a467a79f07
commit fd283d1b24
1 changed files with 448 additions and 0 deletions
--- a/keps/sig-cluster-lifecycle/draft-20180130-kubeadm-join-master.md
+++ b/keps/sig-cluster-lifecycle/draft-20180130-kubeadm-join-master.md
@ -0,0 +1,448 @@
+# kubeadm join --master workflow
+
+## Metadata
+
+```yaml
+---
+kep-number: draft-20180130
+title: kubeadm join --master workflow
+status: accepted
+authors:
+  - "@fabriziopandini"
+owning-sig: sig-cluster-lifecycle
+reviewers:
+  - "@errordeveloper"
+  - "@jamiehannaford"
+approvers:
+  - "@luxas"
+  - "@timothysc"
+  - "@roberthbailey"
+editor:
+  - "@fabriziopandini"
+creation-date: 2018-01-28
+last-updated: 2018-01-28
+see-also:
+  - KEP 0004
+
+```
+
+## Table of Contents
+
+  * [kubeadm join --master workflow](#kubeadm-join---master-workflow)
+    * [Metadata](#metadata)
+    * [Table of Contents](#table-of-contents)
+    * [Summary](#summary)
+    * [Motivation](#motivation)
+      * [Goals](#goals)
+      * [Non-goals](#non-goals)
+      * [Challenges and Open Questions](#challenges-and-open-questions)
+    * [Proposal](#proposal)
+      * [User Stories](#user-stories)
+        * [Add a new master node](#add-a-new-master-node)
+      * [Implementation Details](#implementation-details)
+        * [advertise-address = IP/DNS of the external load balancer](#advertise-address--ipdns-of-the-external-load-balancer)
+        * [kubeadm init --feature-gates=HighAvailability=true](#kubeadm-init---feature-gateshighavailabilitytrue)
+        * [kubeadm join --master workflow](#kubeadm-join---master-workflow-1)
+        * [Strategies for deploying control plane components](#strategies-for-deploying-control-plane-components)
+        * [Strategies for distributing cluster certificates](#strategies-for-distributing-cluster-certificates)
+    * [Graduation Criteria](#graduation-criteria)
+    * [Implementation History](#implementation-history)
+    * [Drawbacks](#drawbacks)
+    * [Alternatives](#alternatives)
+
+## Summary
+
+We are extending the kubeadm distinctive `init` and `join` workflow, introducing the
+capability to add more than one master node to an existing cluster by means of the
+new `kubeadm join --master` option.
+
+As a consequence, kubeadm will provide a best-practice, “fast path” for creating a
+minimum viable, conformant Kubernetes cluster with one or more master nodes and
+zero or more worker nodes.
+
+## Motivation
+
+Support for high availability is one of the most requested features for kubeadm.
+
+Even if, as of today, there is already the possibility to create an HA cluster
+using kubeadm in combination with some scripts and/or automation tools (e.g.
+[this](https://kubernetes.io/docs/setup/independent/high-availability/)), this KEP was
+designed with the objective to introduce an upstream simple and reliable solution for
+achieving the same goal.
+
+### Goals
+
+* "Divide and conquer”
+
+  This proposal - at least in its initial release - does not address all the possible
+  user stories for creating an highly available Kubernetes cluster, but instead
+  focuses on:
+
+  * Defining a generic and extensible flow for bootstrapping an HA cluster, the
+    `kubeadm join --master` workflow.
+  * Providing a solution *only* for one, well defined user story. see
+    [User Stories](#user-stories) and [Non-goals](#non-goals).
+
+* Provide support for a dynamic bootstrap flow
+
+  At the time a user is running `kubeadm init`, s/he/an operator might not know what
+  the cluster setup will look like eventually. For instance, the user may start with
+  only one master + n nodes,  and then add further master nodes with `kubeadm join --master`
+  or add more worker nodes with `kubeadm join` (in any order).
+
+* Enable higher-level tools integration
+
+  We expect higher-level and more tailored tooling to be built on top of kubeadm,
+  and ideally, using kubeadm as the basis of all deployments will make it easier
+  to create conformant cluster.
+
+  Accordingly, the `kubeadm join --master` workflow should provide support for
+  the following operational practices used by higher level tools:
+
+  * Parallel node creation
+
+    Higher-level tools could create nodes in parallel (both masters and workers)
+    for reducing the overall cluster startup time.
+    `kubeadm join --master` should support natively this practice without requiring
+    the implementation of any synchronization mechanics by higher-level tools.
+
+  * Replace reconciliation strategies
+
+    Especially in case of cloud deployments, higher-level automation tools could
+    decide for any reason to replace existing nodes with new ones (instead of apply
+    changes in-place to existing nodes).
+    `kubeadm join --master` will support this practice by making easier to replace
+    existing master nodes with new ones.
+
+### Non-goals
+
+* By design, kubeadm cares only about bootstrapping, not about provisioning machines.
+  Likewise, installing various nice-to-have addons, like the Kubernetes Dashboard,
+  monitoring solutions, and cloud-specific addons, is not in scope.
+
+* This proposal doesn't include a solution for etcd cluster management\*.
+
+* Nothing in this proposal should prevent users to run master nodes components
+  and etcd on the same machines; however users should be aware that this will
+  introduce limitations for strategies like parallel node creation and In-place
+  vs. Replace reconciliation.
+
+* Nothing in this proposal should prevent kubeadm to implement in future a
+  solution for provisioning an etcd cluster based on static pods/pods.
+
+* This proposal doesn't include a solution for API server load balancing.
+
+* Nothing in this proposal should prevent users from choosing their preferred
+  solution for API server load balancing.
+
+* Nothing in this proposal should prevent practices that exist today.
+
+* Nothing in this proposal should prevent user from pre-provision TLS assets
+  before running `kubeadm init` or `kubeadm join --master`.
+
+\* At the time of writing, the CoreOS recommended approach for etcd is to run 
+the etcd cluster outside kubernetes (see discussion in [kubeadm office hours](https://goo.gl/fjyeqo)).
+
+### Challenges and Open Questions
+
+* Keep the UX simple.
+
+  * _What are the acceptable trade-offs between the need to have a clean and simple
+    UX and the complexity of the following challenges and open questions?_
+
+* Create a cluster without knowing its final layout
+
+  Supporting a dynamic workflow implies that some information about the cluster are
+  not available at init time, like e.g. the number of master nodes, the ip of
+  master nodes etc. etc.
+
+  * _How to configure a Kubernetes cluster in order to easily adapt to future change
+    of its own layout like e.g. add a master node, remove a master node?_
+
+  * _What are the "pivotal" cluster settings that must be defined before initialising
+    the cluster?_
+
+  * _What are the mandatory conditions to be verified when executing `kubeadm init`
+    to allow/not allow the execution of `kubeadm join --master` in future?_
+
+* Kubeadm limited scope of action
+
+  * Kubeadm binary can execute actions _only_ on the machine where it is running
+    e.g. it is not possible to execute actions on other nodes, to copy files across
+    nodes etc.
+  * During the join workflow, kubeadm can access the cluster _only_ using identities
+    with limited grants, `system:unauthenticated` or `system:node-bootstrapper`.
+
+* Dependencies graduation
+
+  The solution for `kubeadm join --master` will rely on a set of dependencies/other
+  features which are still in the process for graduating to GA like e.g. dynamic
+  kubelet configuration, self-hosting, component config.
+
+  * _When `kubeadm join --master` should rely entirely on dependencies/features
+    still under graduation vs provide compatibility with older/less convenient but
+    more consolidated approaches?_
+
+  * _Should we support  `kubeadm join --master` for cluster with a control plane
+    deployed as static pods? What about cluster with a self-hosted
+    control plane?_
+  * _Should we support `kubeadm join --master` only  for cluster storing
+    cluster-certificates on file system? What about cluster
+    storing certificates in secrets?_
+
+* Upgradability
+
+  * How to setup an high available cluster in order to simplify the execution
+    of cluster version upgrades, both manually or with the support of `kubeadm upgrade`?_
+
+## Proposal
+
+### User Stories
+
+#### Add a new master node
+
+As a kubernetes administrator, I want to run  `kubeadm join --master`  for adding
+a new master node* to an existing Kubernetes cluster**, so that the cluster become
+more resilient to failures of the existing master nodes (high availability).
+
+\* A new "master node" is a new  kubernetes node with
+`node-role.kubernetes.io/master=""` label and
+`node-role.kubernetes.io/master:NoSchedule` taint; a new instance of control plane
+components will be deployed on the new master node
+
+> NB. In this first release of the proposal creating a new master node doesn't
+trigger the creation of a new etcd member on the same machine.
+
+\*\* In this first release of the proposal, `kubeadm join --master` could be
+executed _only_ on Kubernetes cluster compliant with following conditions:
+
+* The cluster was initialized with `kubeadm init`.
+* The cluster was initialized with `--feature-gates=HighAvailability=true`.
+* The cluster uses an external etcd.
+* An external load balancer was provisioned and the IP/DNS of the external
+  load balancer is used as advertise-address for the kube-api server.
+
+### Implementation Details
+
+#### advertise-address = IP/DNS of the external load balancer
+
+There are many ways to configure an highly available cluster.
+
+After prototyping and various discussions in
+[kubeadm office hours](https://youtu.be/HcvVi8O_ZGY), it was agreed to implement
+the approach that sets the `--advertise-address` equal to the IP/DNS of the
+external load balancer, without assigning dedicated `--advertise-address` IPs 
+for each master nodes.
+
+By excluding the IP of master nodes, kubeadm can create a unique API server
+serving certificate, and share this certificate across many masters nodes;
+no changes will be required to this certificate when adding/removing master nodes.
+
+Such properties make this approach best suited for the initial set up of
+the desired `kubeadm join --master` dynamic workflow.
+
+> Please note that in this scenario the kubernetes service will always resolve
+to the IP/DNS of the external load balancer, instead of resolving to the list
+of master IPs, but this fact was considered an acceptable trade-off at this stage.
+
+> It is expected to add support also for different HA configurations in future releases
+of this KEP.
+
+#### kubeadm init --feature-gates=HighAvailability=true
+
+When executing  `kubeadm join --master`, due to current kubeadm limitations, only
+few information about the cluster/about other master nodes are available.
+
+As a consequence, this proposal delegates to the initial `kubeadm init` - when
+executed with `--feature-gates=HighAvailability=true` - all the controls about
+the compliance of the cluster with the supported user story:
+
+* The cluster uses an external etcd.
+* An external load balancer is provisioned and the IP/DNS of the external load balancer is used as advertise-address.
+
+#### kubeadm join --master workflow
+
+The `kubeadm join --master` target workflow is an extension of the
+existing `kubeadm join` flow:
+
+1. Discovery cluster info [No changes to this step]
+
+   Access the `cluster-info` configMap in `kube-public` namespace (or read
+   the same information provided in a file).
+
+   > This step waits for a first instance of the kube-apiserver to become ready;
+   such wait cycle acts as embedded mechanism for handling the sequence
+   `kubeadm init` and `kubeadm join` in case of parallel node creation.
+
+2. In case of `join --master` [New step]
+
+   1. Using the bootstrap token as identity, read the `kubeadm-config` configMap
+      in `kube-system` namespace.
+
+      > This requires to grant access to the above configMap for
+      `system:node-bootstrapper` group (or to provide the same information
+      provided in a file like in 1.).
+
+   2. Check if the cluster is ready for joining a new master node:
+
+      a. Check if the cluster was created with  `--feature-gates=HighAvailability=true`.
+
+      > We assume that all the necessary conditions where already checked
+      during `kubeadm init`:
+      > * The cluster uses an external etcd.
+      > * An external load balancer is provisioned and the IP/DNS of the external
+    load balancer is used as advertise-address.
+
+      b. In case of cluster certificates stored on file system, check if the
+      expected certificates exists.
+
+      > see "Strategies for distributing cluster certificates" paragraph for
+      additional info about this step.
+
+   3. Prepare the node for joining as a master node:
+
+      a. In case of control plane deployed as static pods, create kubeconfig files
+      and static pod manifests for control plane components.
+
+      >  see "Strategies for deploying control plane components" paragraph
+      for additional info about this step.
+
+   4. Create the admin.conf kubeconfig file
+
+3. Executes the TLS bootstrap process, including [No changes to this step]:
+
+   1. Start kubelet using the bootstrap token as identity
+   2. Request a certificate for the node - with the node identity - and retrieves
+      it after it is automatically approved
+   3. Restart kubelet with the node identity
+   4. Eventually, apply the kubelet dynamic configuration
+
+4. In case of `join --master` [New step]
+
+   1. Apply master taint and label to the node.
+
+      > This action is executed using the admin.conf identity created above;
+      >
+      > This action triggers the deployment of master components in case of
+        self-hosted control plane
+
+#### Strategies for deploying control plane components
+
+As of today kubeadm supports two solutions for deploying control plane components:
+
+1. Control plane deployed as static pods (current kubeadm default)
+2. Self-hosted control plane in case of `--feature-gates=SelfHosting=true`
+
+"Self-hosted control plane" is a solution that we expect - *in the long term* -
+will become mainstream, because it simplifies both deployment and upgrade of control
+plane components due to the fact that Kubernetes itself will take care of deploying
+corresponding pods on nodes.
+
+Unfortunately, at the time of writing it is unknown when this feature will graduate
+to beta/GA or when this feature will become the new kubeadm default; as a consequence,
+this proposal assumes that is still required to provide a solution both for case 1.
+and case 2.
+
+The proposed solution for case 1. "Control plane deployed as static pods", assumes
+that the `kubeadm join --master` flow will take care of creating required kubeconfig
+files and required static pod manifests.
+
+Case 2. "Self-hosted control plane," as described above, does not requires any
+additional steps to be implemented in the  `kubeadm join --master` flow.
+
+#### Strategies for distributing cluster certificates
+
+As of today kubeadm supports two solutions for storing cluster certificates:
+
+1. Cluster certificates stored on file system in case of:
+   * Control plane deployed as static pods (current kubeadm default)
+   * Self-hosted control plane in case of `--feature-gates=SelfHosting=true`
+2. Cluster certificates stored in secrets in case of:
+   * Self-hosted control plane + secrets in certs in case of
+     `--feature-gates=SelfHosting=true,StoreCertsInSecrets=true`
+
+"Storing cluster certificates in secrets" is a solution that we expect - *in the
+long term* - will become mainstream, because it simplifies certificates distribution
+and also certificate rotation, due to the fact that Kubernetes itself will take
+care of distributing certs on nodes.
+
+Unfortunately, at the time of writing it is unknown when this feature will graduate
+to beta/GA or when this feature will become the new kubeadm default; as a
+consequence, this proposal assumes it is required to provide a solution for both
+for case 1 and case 2.
+
+The proposed solution for case 1. "Cluster certificates stored on file system",
+requires the user/the higher level tools to execute an additional action _before_
+invoking `kubeadm join --master` (NB. kubeadm is limited  to execute actions *only*
+in the machine where it is running, so it is not possible to copy automatically
+certificates from remote locations).
+
+More specifically, in case of cluster with "cluster certificates stored on file
+system", before invoking `kubeadm join --master`, the user/higher level tools
+should copy control plane certificates from an existing node, e.g. the node
+where `kubeadm init` was run, to the joining node.
+
+Then, the  `kubeadm join --master` flow  will take care of checking certificates
+existence and conformance.
+
+Case 2. "Cluster certificates stored in secrets", as described above, does not
+requires any additional steps to be implemented in the  `kubeadm join --master`
+flow .
+
+## Graduation Criteria
+
+* To create a periodic E2E test that bootstraps an HA cluster with kubeadm
+  and exercise the dynamic bootstrap workflow
+* To ensure upgradability of HA clusters (possibly with another E2E test)
+* To document the kubeadm support for HA in kubernetes.io
+
+## Implementation History
+
+* original HA proposals [#1](https://goo.gl/QNtj5T) and [#2](https://goo.gl/C8V8PV)
+* merged [Kubeadm HA design doc](https://goo.gl/QpD5h8)
+* HA prototype [demo](https://goo.gl/2WLUUc) and [notes](https://goo.gl/NmTahy)
+* [PR #58261](https://github.com/kubernetes/kubernetes/pull/58261)
+
+## Drawbacks
+
+This proposal provides support for a single, well defined HA scenario.
+While this is considered a sustainable approach to the complexity of HA in Kubernetes,
+the limited scope of this proposal could be negatively perceived by final users.
+
+## Alternatives
+
+1)  Execute `kubeadm init` on many nodes
+
+The approach based on execution of `kubeadm init` on each master was considered as well,
+but not chosen because it seems to have several draw backs:
+
+*  There is no real control on parameters passed to `kubeadm init` executed on secondary masters,
+    and this can lead to unpredictable inconsistent configurations.
+*  The init sequence for secondary master won't go through the TLS boostrap process,
+    and this can be perceived security concern.
+*  The init sequence executes a lot of steps which are un-necessary on a secondary master;
+    now those steps are mostly idempotent, so basically now no harm is done by executing
+    them two or three times. Nevertheless to mantain this contract in future could be complex.
+
+2) Allow HA configurations with `--advertise-address` equal to the master ip address
+(and adding the IP/DNS of the external load balancer as an additional apiServerCertSANs).
+
+After some testing, this option was considered too complex/not
+adequate for the initial set up of the desired `kubeadm join --master` dynamic workflow;
+this can be better explained by looking at two implementation based on this option:
+
+* [kubernetes the hard way](https://github.com/kelseyhightower/kubernetes-the-hard-way)
+   uses the IP address of all master nodes for creating a new API server
+   serving certificate before bootstrapping the cluster, but this approach
+   can't be used if considering the desired dynamic workflow.
+
+* [Creating HA cluster with kubeadm](https://kubernetes.io/docs/setup/independent/high-availability/)
+  uses a different API server serving certificate for each master, and this
+  could increases the complexity of the first implementation because:
+  * the `kubeadm join --master` flow has to generate different certificates for
+    each master node.
+  * self-hosting control plane, should be adapted to mount different certificates
+    for each master.
+  * bootstrap check pointing should be designed to checkpoint a different
+    set of certificates for each master.
+  * upgrades should be adapted to consider master specific settings