# High Availability Considerations This document contains a collection of community-provided considerations for setting up High Availability Kubernetes clusters. If something is incomplete, not clear or for additional information, please feel free to create a PR for a contribution. A good place for asking questions or making remarks is the `#kubeadm` channel on the [Kubernetes slack](https://slack.k8s.io/) where most of the contributors are usually active. - [High Availability Considerations](#high-availability-considerations) - [Overview](#overview) - [Options for Software Load Balancing](#options-for-software-load-balancing) - [keepalived and haproxy](#keepalived-and-haproxy) - [kube-vip](#kube-vip) - [Bootstrap the cluster](#bootstrap-the-cluster) ## Overview When setting up a production cluster, high availability (the cluster's ability to remain operational even if some control plane or worker nodes fail) is usually a requirement. For worker nodes, assuming that there are enough of them, this is part of the very cluster functionality. However redundancy of control plane nodes and `etcd` instances needs to be catered for when planning and setting up a cluster. `kubeadm` supports setting up of multi control plane and multi `etcd` clusters (see [Creating Highly Available clusters with kubeadm](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/) for a step-by-step guide). Still there are some aspects to consider and set up which are not part of Kubernetes itself and hence not covered in the project documentation. This document provides some additional information and examples useful when planning and bootstrapping HA clusters with `kubeadm`. ## Options for Software Load Balancing When setting up a cluster with more than one control plane, higher availability can be achieved by putting the API Server instances behind a load balancer and using the `--control-plane-endpoint` option when running `kubeadm init` for the new cluster to use it. Of course, the load balancer itself should be highly available, too. This is usually achieved by adding redundancy to the load balancer. In order to do so, a cluster of hosts managing a virtual IP is set up with each host running an instance of the load balancer, so that always the load balancer on the host currently holding the vIP will be used while the others are on standby. In some environments, like in data centers with dedicated load balancing components (provided e.g. by some cloud-providers), this functionality may already be available. If it is not, user-managed load balancing can be used. In that case some preparation is necessary before bootstrapping a cluster. Since this is not part of Kubernetes or `kubeadm`, this must be taken care of separately. In the following sections, we give examples that have been working for some people while of course there are potentially dozens of other possible configurations. ## keepalived and haproxy For providing load balancing from a virtual IP the combination [keepalived](https://www.keepalived.org) and [haproxy](https://www.haproxy.com) has been around for a long time and can be considered well-known and well-tested: - The `keepalived` service provides a virtual IP managed by a configurable health check. Due to the way the virtual IP is implemented, all the hosts between which the virtual IP is negotiated need to be in the same IP subnet. - The `haproxy` service can be configured for simple stream-based load balancing thus allowing TLS termination to be handled by the API Server instances behind it. This combination can be run either as services on the operating system or as static pods on the control plane hosts. The service configuration is identical for both cases. ### keepalived configuration The `keepalived` configuration consists of two files: the service configuration file and a health check script which will be called periodically to verify that the node holding the virtual IP is still operational. The files are assumed to reside in a `/etc/keepalived` directory. Note that however some Linux distributions may keep them elsewhere. The following configuration has been successfully used with `keepalived` version 2.0.20 and 2.2.4: ```bash ! /etc/keepalived/keepalived.conf ! Configuration File for keepalived global_defs { router_id LVS_DEVEL } vrrp_script check_apiserver { script "/etc/keepalived/check_apiserver.sh" interval 3 weight -2 fall 10 rise 2 } vrrp_instance VI_1 { state ${STATE} interface ${INTERFACE} virtual_router_id ${ROUTER_ID} priority ${PRIORITY} authentication { auth_type PASS auth_pass ${AUTH_PASS} } virtual_ipaddress { ${APISERVER_VIP} } track_script { check_apiserver } } ``` There are some placeholders in `bash` variable style to fill in: - `${STATE}` is `MASTER` for one and `BACKUP` for all other hosts, hence the virtual IP will initially be assigned to the `MASTER`. - `${INTERFACE}` is the network interface taking part in the negotiation of the virtual IP, e.g. `eth0`. - `${ROUTER_ID}` should be the same for all `keepalived` cluster hosts while unique amongst all clusters in the same subnet. Many distros pre-configure its value to `51`. - `${PRIORITY}` should be higher on the control plane node than on the backups. Hence `101` and `100` respectively will suffice. - `${AUTH_PASS}` should be the same for all `keepalived` cluster hosts, e.g. `42` - `${APISERVER_VIP}` is the virtual IP address negotiated between the `keepalived` cluster hosts. The above `keepalived` configuration uses a health check script `/etc/keepalived/check_apiserver.sh` responsible for making sure that on the node holding the virtual IP the API Server is available. This script could look like this: ``` #!/bin/sh errorExit() { echo "*** $*" 1>&2 exit 1 } curl -sfk --max-time 2 https://localhost:${APISERVER_DEST_PORT}/healthz -o /dev/null || errorExit "Error GET https://localhost:${APISERVER_DEST_PORT}/healthz" ``` Fill in the placeholder `${APISERVER_DEST_PORT}` with the port through which Kubernetes will talk to the API Server. That is the port haproxy or your load balancer will be listening on. ### haproxy configuration The `haproxy` configuration consists of one file: the service configuration file which is assumed to reside in a `/etc/haproxy` directory. Note that however some Linux distributions may keep them elsewhere. The following configuration has been successfully used with `haproxy` version 2.4 and 2.8: ```bash # /etc/haproxy/haproxy.cfg #--------------------------------------------------------------------- # Global settings #--------------------------------------------------------------------- global log stdout format raw local0 daemon #--------------------------------------------------------------------- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block #--------------------------------------------------------------------- defaults mode http log global option httplog option dontlognull option http-server-close option forwardfor except 127.0.0.0/8 option redispatch retries 1 timeout http-request 10s timeout queue 20s timeout connect 5s timeout client 35s timeout server 35s timeout http-keep-alive 10s timeout check 10s #--------------------------------------------------------------------- # apiserver frontend which proxys to the control plane nodes #--------------------------------------------------------------------- frontend apiserver bind *:${APISERVER_DEST_PORT} mode tcp option tcplog default_backend apiserverbackend #--------------------------------------------------------------------- # round robin balancing for apiserver #--------------------------------------------------------------------- backend apiserverbackend option httpchk http-check connect ssl http-check send meth GET uri /healthz http-check expect status 200 mode tcp balance roundrobin server ${HOST1_ID} ${HOST1_ADDRESS}:${APISERVER_SRC_PORT} check verify none # [...] ``` Again, there are some placeholders in `bash` variable style to expand: - `${APISERVER_DEST_PORT}` the port through which Kubernetes will talk to the API Server. - `${APISERVER_SRC_PORT}` the port used by the API Server instances - `${HOST1_ID}` a symbolic name for the first load-balanced API Server host - `${HOST1_ADDRESS}` a resolvable address (DNS name, IP address) for the first load-balanced API Server host - additional `server` lines, one for each load-balanced API Server host ### Option 1: Run the services on the operating system In order to run the two services on the operating system, the respective distribution's package manager can be used to install the software. This can make sense if they will be running on dedicated hosts not part of the Kubernetes cluster. Having now installed the abovementioned configuration, the services can be enabled and started. On a recent RedHat-based system, `systemd` will be used for this: ``` # systemctl enable haproxy --now # systemctl enable keepalived --now ``` With the services up, now the Kubernetes cluster can be bootstrapped using `kubeadm init` (see [below](#bootstrap-the-cluster)). ### Option 2: Run the services as static pods If `keepalived` and `haproxy` will be running on the control plane nodes they can be configured to run as static pods. All that is necessary here is placing respective manifest files in the `/etc/kubernetes/manifests` directory before bootstrapping the cluster. During the bootstrap process, `kubelet` will bring the processes up, so that the cluster can use them while starting. This is an elegant solution, in particular with the setup described under [Stacked control plane and etcd nodes](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#stacked-control-plane-and-etcd-nodes). For this setup, two manifest files need to be created in `/etc/kubernetes/manifests` (create the directory first). The manifest for `keepalived`, `/etc/kubernetes/manifests/keepalived.yaml`: ```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: null name: keepalived namespace: kube-system spec: containers: - image: osixia/keepalived:2.0.20 name: keepalived resources: {} securityContext: capabilities: add: - NET_ADMIN - NET_BROADCAST - NET_RAW volumeMounts: - mountPath: /usr/local/etc/keepalived/keepalived.conf name: config - mountPath: /etc/keepalived/check_apiserver.sh name: check hostNetwork: true volumes: - hostPath: path: /etc/keepalived/keepalived.conf name: config - hostPath: path: /etc/keepalived/check_apiserver.sh name: check status: {} ``` The manifest for `haproxy`, `/etc/kubernetes/manifests/haproxy.yaml`: ```yaml apiVersion: v1 kind: Pod metadata: name: haproxy namespace: kube-system spec: containers: - image: haproxy:2.8 name: haproxy livenessProbe: failureThreshold: 8 httpGet: host: localhost path: /healthz port: ${APISERVER_DEST_PORT} scheme: HTTPS volumeMounts: - mountPath: /usr/local/etc/haproxy/haproxy.cfg name: haproxyconf readOnly: true hostNetwork: true volumes: - hostPath: path: /etc/haproxy/haproxy.cfg type: FileOrCreate name: haproxyconf status: {} ``` Note that here again a placeholder needs to be filled in: `${APISERVER_DEST_PORT}` needs to hold the same value as in `/etc/haproxy/haproxy.cfg` (see above). This combination has been successfully used with the versions used in the example. Other versions might work as well or may require changes to the configuration files. With the services up, now the Kubernetes cluster can be bootstrapped using `kubeadm init` (see [below](#bootstrap-the-cluster)). ## kube-vip As an alternative to the more "traditional" approach of `keepalived` and `haproxy`, [kube-vip](https://kube-vip.io/) implements both management of a virtual IP and load balancing in one service. It can be implemented either as layer2 (with ARP, and `leaderElection`) or layer3 utilising BGP peering. Similar to option 2 above, `kube-vip` will be run as a static pod on the control plane nodes. Like with `keepalived`, the hosts negotiating a virtual IP need to be in the same IP subnet. Similarly, like with `haproxy`, stream-based load-balancing allows TLS termination to be handled by the API Server instances behind it. **NOTE** `kube-vip` requires access to the API server, especially during a cluster initialisation (during the `kubeadm init` phase). At this point the `admin.conf` is the only kubeconfig that is available to `kube-vip` to authenticate and communicate with the API-server. Post cluster stand up it is recommended that a user sign a custom client kubeconfig and rotate it manually on expiration. ### Generating a Manifest This section details creating a number of manifests for various use cases #### Set configuration details ``` export VIP=192.168.0.40 export INTERFACE= ``` ### Configure to use a container runtime #### Get latest version We can parse the GitHub API to find the latest version (or we can set this manually) `KVVERSION=$(curl -sL https://api.github.com/repos/kube-vip/kube-vip/releases | jq -r ".[0].name")` or manually: `export KVVERSION=vx.x.x` The easiest method to generate a manifest is using the container itself, below will create an alias for different container runtimes. #### containerd `alias kube-vip="ctr run --rm --net-host ghcr.io/kube-vip/kube-vip:$KVVERSION vip /kube-vip"` #### Docker `alias kube-vip="docker run --network host --rm ghcr.io/kube-vip/kube-vip:$KVVERSION"` ### ARP This configuration will create a manifest that starts `kube-vip` providing **controlplane** and **services** management, using **leaderElection**. When this instance is elected as the leader it will bind the `vip` to the specified `interface`, this is also the same for services of `type:LoadBalancer`. `export INTERFACE=eth0` ``` kube-vip manifest pod \ --interface $INTERFACE \ --vip $VIP \ --controlplane \ --arp \ --leaderElection | tee /etc/kubernetes/manifests/kube-vip.yaml ``` #### Example manifest ```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: null name: kube-vip namespace: kube-system spec: containers: - args: - manager env: - name: vip_arp value: "true" - name: port value: "6443" - name: vip_interface value: ens192 - name: vip_cidr value: "32" - name: cp_enable value: "true" - name: cp_namespace value: kube-system - name: vip_ddns value: "false" - name: vip_leaderelection value: "true" - name: vip_leaseduration value: "5" - name: vip_renewdeadline value: "3" - name: vip_retryperiod value: "1" - name: vip_address value: 192.168.0.40 image: ghcr.io/kube-vip/kube-vip:v0.4.0 imagePullPolicy: Always name: kube-vip resources: {} securityContext: capabilities: add: - NET_ADMIN - NET_RAW - SYS_TIME volumeMounts: - mountPath: /etc/kubernetes/admin.conf name: kubeconfig hostAliases: - hostnames: - kubernetes ip: 127.0.0.1 hostNetwork: true volumes: - hostPath: path: /etc/kubernetes/admin.conf name: kubeconfig status: {} ``` ### BGP This configuration will create a manifest that will start `kube-vip` providing **controlplane** and **services** management. **Unlike** ARP, all nodes in the BGP configuration will advertise virtual IP addresses. **Note** we bind the address to `lo` as we don't want multiple devices that have the same address on public interfaces. We can specify all the peers in a comma separated list in the format of `address:AS:password:multihop`. `export INTERFACE=lo` ``` kube-vip manifest pod \ --interface $INTERFACE \ --vip $VIP \ --controlplane \ --bgp \ --localAS 65000 \ --bgpRouterID 192.168.0.2 \ --bgppeers 192.168.0.10:65000::false,192.168.0.11:65000::false | tee /etc/kubernetes/manifests/kube-vip.yaml ``` #### Example Manifest ```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: null name: kube-vip namespace: kube-system spec: containers: - args: - manager env: - name: vip_arp value: "false" - name: port value: "6443" - name: vip_interface value: ens192 - name: vip_cidr value: "32" - name: cp_enable value: "true" - name: cp_namespace value: kube-system - name: vip_ddns value: "false" - name: bgp_enable value: "true" - name: bgp_routerid value: 192.168.0.2 - name: bgp_as value: "65000" - name: bgp_peeraddress - name: bgp_peerpass - name: bgp_peeras value: "65000" - name: bgp_peers value: 192.168.0.10:65000::false,192.168.0.11:65000::false - name: vip_address value: 192.168.0.40 image: ghcr.io/kube-vip/kube-vip:v0.4.0 imagePullPolicy: Always name: kube-vip resources: {} securityContext: capabilities: add: - NET_ADMIN - NET_RAW - SYS_TIME volumeMounts: - mountPath: /etc/kubernetes/admin.conf name: kubeconfig hostAliases: - hostnames: - kubernetes ip: 127.0.0.1 hostNetwork: true volumes: - hostPath: path: /etc/kubernetes/admin.conf name: kubeconfig status: {} ``` With the services up, now the Kubernetes cluster can be bootstrapped using `kubeadm init` (see [below](#bootstrap-the-cluster)). ## Bootstrap the cluster Now the actual cluster bootstrap as described in [Creating Highly Available clusters with kubeadm](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/) can take place. Note that, if `${APISERVER_DEST_PORT}` has been configured to a value different from `6443` in the configuration above, `kubeadm init` needs to be told to use that port for the API Server. Assuming that in a new cluster port 8443 is used for the load-balanced API Server and a virtual IP with the DNS name `vip.mycluster.local`, an argument `--control-plane-endpoint` needs to be passed to `kubeadm` as follows: ``` # kubeadm init --control-plane-endpoint vip.mycluster.local:8443 [additional arguments ...] ```