a) The current implementation use's a static kubelet which doesn't not conform to the Node authorization mode (i.e. system:nodes:<nodename>)
b) As present the kubeconfig is static and reused across all the masters and nodes
The PR firstly introduces the ability for users to use bootstrap tokens and secondly when enabled ensure the kubelets for the masters as have unique usernames. Note, this PR does not attempt to address the distribution of the bootstrap tokens themselves, that's for cluster admins. One solution for this would be a daemonset on the masters running on hostNetwork and reuse dns-controller to annotated the pods and give as the DNS
Notes:
- the master node do not use bootstrap tokens, instead given they have access to the ca anyhow, we generate certificates for each.
- when bootstrap token is not enabled the behaviour will stay the same; i.e. a kubelet configuration brought down from the store.
- when bootstrap tokens are enabled, the Nodes sit in a timeout loop waiting for the configuration to appear (by third party).
- given the nodeup docker and manifests builders are executed before the kubelet builder, the assumption here is a unit file kicks of a custom container to bootstrap the rest.
- the current firewalls on between the master and nodes are fairly open so no need to open ports between the two
- much of the work was ported from @justinsb PR [here](https://github.com/kubernetes/kops/pull/4134/)
- we add a very presumptuous server and client certificates for use with an authorizer (node-bootstrap-internal.dns_zone)
I do have an additional PR which performs the entire thing. The process being a node_authorizer which runs on the master nodes via a daemonset, the service implements a series of authorizers (i.e. alwaysallow, aws, gce etc). For aws, the process is similar to how vault authorizes nodes [here](https://www.vaultproject.io/docs/auth/aws.html). Nodeup no then calls out to the node_authorizer on bootstrap and provisions the kubelet.
Kubernetes 1.9 changed the default for etcd-quorum-read flag value to
true, in the hope of fixing some of the edge-case controller issues.
However, while this is cheap on etcd3, that fix was not backported to
etcd2, and performance there of quorum reads is poor.
For non-HA clusters with etcd2, it still goes through raft, but does not
need to - we set etcd-quorum-read to false, as this is just a missed
optimization in etcd2.
For HA clusters with etcd2, it's trickier, but at least for now we're
going to avoid the (crippling) performance regression. kops 1.10 should
have etcd-manager (allowing upgrades to etcd3), and the ability to
configure IOPS on the etcd volume, so we can revisit this in 1.10 /
1.11.
Automatic merge from submit-queue.
Initial aggregation work
Create the keypairs, which are supposed to be signed by a different CA.
Set the `--requestheader-...` flags on apiserver.
Fix#3152Fix#2691
- removing the StorageType on the etcd cluster spec (sticking with the Version field only)
- changed the protokube flag back to -etcd-image
- users have to explicitly set the etcd version now; the latest version in gcr.io is 3.0.17
- reverted the ordering on the populate spec
The current implementation is running v2.2.1 which is two year old and end of life. This PR add the ability to use etcd and set the versions if required. Note at the moment the image is still using the gcr.io registry image. As note, much like TLS their presently is not 'automated' migration path from v2 to v3.
- the feature is gated behine the storageType of the etcd cluster, bot clusters events and main must use the same storage type
- the version for v2 is unchanged and pinned at v2.2.1 with v2 using v3.0.17
- @question: we shoudl consider allowing the use to override the images though I think this should be addresses more generically, than one offs here and then. I know chris is working on a asset registry??
User reports of kubelet flags not being passed; moved more to code.
Also found & fixed the likely root-cause issue: we have two copies of
the cluster spec and were not being precise about which one we wanted to
use at all times.
It looks like it got lost in a refactor. Add a unit test, and move
initialization to code (and have the code self-check as well).
Also we can now have a fairly long code comment about the reasons why
this is such a mess...
Fix#371