Updates: requisite and preferred topology generation; upgrade considerations; statefulset volume spreading
This commit is contained in:
parent
74b121faba
commit
7302f0a734
|
@ -91,15 +91,21 @@ CSI volume drivers should create a socket at the following path on the node mach
|
|||
|
||||
Upon initialization of the external “CSI volume driver”, kubelet must call the CSI method `NodeGetInfo` to get the mapping from Kubernetes Node names to CSI driver NodeID and the associated `accessible_topology`. It must:
|
||||
|
||||
* Create/update a new `CSINodeInfo` object instance for the node with the NodeID and topology keys from `accessible_topology`.
|
||||
* Create/update a `CSINodeInfo` object instance for the node with the NodeID and topology keys from `accessible_topology`.
|
||||
* This will enable the component that will issue `ControllerPublishVolume` calls to use the `CSINodeInfo` as a mapping from cluster node ID to storage node ID.
|
||||
* This will enable the component that will issue `CreateVolume` to reconstruct `accessible_topology` and provision a volume that is accesible from specific node.
|
||||
* Each driver must completely overwrite its previous version of NodeID and topology keys, if they exist.
|
||||
* If the `NodeGetInfo` call fails, kubelet must delete any previous NodeID and topology keys for this driver.
|
||||
* When kubelet plugin unregistration mechanism is implemented, delete NodeID and topology keys when a driver is unregistered.
|
||||
|
||||
* Update Node API object with the CSI driver NodeID as the `csi.volume.kubernetes.io/nodeid` annotation. The value of the annotation is a JSON blob, containing key/value pairs for each CSI driver. For example:
|
||||
```
|
||||
csi.volume.kubernetes.io/nodeid: "{ \"driver1\": \"name1\", \"driver2\": \"name2\" }
|
||||
```
|
||||
This annotation is deprecated and will be removed according to deprecation policy (1 year after deprecation). TODO mark deprecation date.
|
||||
|
||||
*This annotation is deprecated and will be removed according to deprecation policy (1 year after deprecation). TODO mark deprecation date.*
|
||||
* If the `NodeGetInfo` call fails, kubelet must delete any previous NodeID for this driver.
|
||||
* When kubelet plugin unregistration mechanism is implemented, delete NodeID and topology keys when a driver is unregistered.
|
||||
|
||||
* Create/update Node API object with `accessible_topology` as labels.
|
||||
There are no hard restrictions on the label format, but for the format to be used by the recommended setup, please refer to [Topology Representation in Node Objects](#topology-representation-in-node-objects).
|
||||
|
@ -155,10 +161,13 @@ In short, to dynamically provision a new CSI volume, a cluster admin would creat
|
|||
|
||||
To provision a new CSI volume, an end user would create a `PersistentVolumeClaim` object referencing this `StorageClass`. The external provisioner will react to the creation of the PVC and issue the `CreateVolume` call against the CSI volume driver to provision the volume. The `CreateVolume` name will be auto-generated as it is for other dynamically provisioned volumes. The `CreateVolume` capacity will be taken from the `PersistentVolumeClaim` object. The `CreateVolume` parameters will be passed through from the `StorageClass` parameters (opaque to Kubernetes).
|
||||
|
||||
If the `PersistentVolumeClaim` has the `volume.alpha.kubernetes.io/selected-node` annotation set (only added if delayed volume binding is enabled in the `StorageClass`), the provisioner will get relevant topology keys from the corresponding `CSINodeInfo` instance and the topology values from `Node` labels and pass them to the `CreateVolume` call as preferred topology. `AllowedTopologies` from the `StorageClass` is passed through as requisite topology. Before calling `CreateVolume`, the provisioner will also validate `AllowedTopologies` against a cache of all known topology values in the cluster, where the cache is populated by a Node informer. If `AllowedTopologies` is unspecified, the provisioner will pass in all topology values as requisite topology.
|
||||
If the `PersistentVolumeClaim` has the `volume.alpha.kubernetes.io/selected-node` annotation set (only added if delayed volume binding is enabled in the `StorageClass`), the provisioner will get relevant topology keys from the corresponding `CSINodeInfo` instance and the topology values from `Node` labels and use them to generate preferred topology in the `CreateVolume()` request. If the annotation is unset, preferred topology will not be specified (unless the PVC follows StatefulSet naming format, discussed later in this section). `AllowedTopologies` from the `StorageClass` is passed through as requisite topology. If `AllowedTopologies` is unspecified, the provisioner will pass in a set of aggregated topology values across the whole cluster as requisite topology.
|
||||
|
||||
**TODO** (verult) discuss topology aggregation strategy
|
||||
**TODO** (verult) discuss passing multiple preferred topology segments
|
||||
To perform this topology aggregation, the external provisioner will cache all existing Node objects. In order to prevent a compromised node from affecting the provisioning process, it will pick a single node as the source of truth for keys, instead of relying on keys stored in `CSINodeInfo` for each node object. For PVCs to be provisioned with late binding, the selected node is the source of truth; otherwise a random node is picked. The provisioner will then iterate through all cached nodes that contain a node ID from the driver, aggregating labels using those keys. Note that if topology keys are different across the cluster, only a subset of nodes matching the topology keys of the chosen node will be considered for provisioning.
|
||||
|
||||
To generate preferred topology, the external provisioner will generate N segments for preferred topology in the `CreateVolume()` call, where N is the size of requisite topology. Multiple segments are included to support volumes that are available across multiple topological segments. The topology segment from the selected node will always be the first in preferred topology. All other segments are some reordering of remaining requisite topologies such that given a requisite topology (or any arbitrary reordering of it) and a selected node, the set of preferred topology is guaranteed to always be the same.
|
||||
|
||||
If immediate volume binding mode is set and the PVC follows StatefulSet naming format, then the provisioner will choose, as the first segment in preferred topology, a segment from requisite topology based on the PVC name that ensures an even spread of topology across the StatefulSet's volumes. The logic will be similar to the name hashing logic inside the GCE Persistent Disk provisioner. Other segments in preferred topology are ordered the same way as described above. This feature will be flag-gated in the external provisioner provided as part of the recommended deployment method.
|
||||
|
||||
Once the operation completes successfully, the external provisioner creates a `PersistentVolume` object to represent the volume using the information returned in the `CreateVolume` response. The topology of the returned volume is translated to the `PersistentVolume` `NodeAffinity` field. The `PersistentVolume` object is then bound to the `PersistentVolumeClaim` and available for use.
|
||||
|
||||
|
@ -455,13 +464,14 @@ The list of topology keys known to the driver is stored separately in the `CSINo
|
|||
|
||||
Justifications:
|
||||
* No strange separators needed, comparing to the alternative. Cleaner format.
|
||||
* The same topology key could be used across different sources (different storage plugin, network plugin, etc.)
|
||||
* The same topology key could be used across different components (different storage plugin, network plugin, etc.)
|
||||
* Once NodeRestriction is moved to the newer model (see [here](https://github.com/kubernetes/community/pull/911) for context), for each new label prefix introduced in a new driver, the cluster admin has to configure NodeRestrictions to allow the driver to update labels with the prefix. Cluster installations could include certain prefixes for pre-installed drivers by default. This is less convenient compared to the alternative, which can allow editing of all CSI drivers by default using the “csi.kubernetes.io” prefix, but often times cluster admins have to whitelist those prefixes anyway (for example ‘cloud.google.com’)
|
||||
|
||||
Considerations:
|
||||
* Upon driver deletion/upgrade/downgrade, stale labels will be left untouched. It’s difficult for the driver to decide whether other sources rely on this label.
|
||||
* During driver installation/upgrade/downgrade, controller deployment must be brought down before node deployment, and node deployment must be deployed before the controller deployment, because provisioning relies on up-to-date node information.
|
||||
* Topology keys inside `CSINodeInfo` must reflect the topology keys from drivers currently installed on the node. If no driver is installed, the collection must be empty.
|
||||
* Upon driver deletion/upgrade/downgrade, stale labels will be left untouched. It’s difficult for the driver to decide whether other components outside CSI rely on this label.
|
||||
* During driver installation/upgrade/downgrade, controller deployment must be brought down before node deployment, and node deployment must be deployed before the controller deployment, because provisioning relies on up-to-date node information. One possible issue is if only topology values change while keys remain the same, and if AllowedTopologies is not specified, requisite topology will contain both old and new topology values, and CSI driver may fail the CreateVolume() call. Given that CSI driver should be backward compatible, this is more of an issue when a node rolling upgrade happens before the controller update. It's not an issue if keys are changed as well since requisite and preferred topology generation handles it appropriately.
|
||||
* During driver installation/upgrade/downgrade, if a version of the controller (either old or new) is running while there is an ongoing rolling upgrade with the node deployment, and the new version of the CSI driver reports different topology information, nodes in the cluster may have different versions of topology information. However, this doesn't pose an issue. If AllowedTopologies is specified, a subset of nodes matching the version of topology information in AllowedTopologies will be used as provisioning candidate. If AllowedTopologies is not specified, a single node is used as the source of truth for keys
|
||||
* Topology keys inside `CSINodeInfo` must reflect the topology keys from drivers currently installed on the node. If no driver is installed, the collection must be empty. However, due to the possible race condition between kubelet (the writer) and the external provisioner (the reader), the provisioner must gracefully handle the case where `CSINodeInfo` is not up-to-date. In the current design, the provisioner will erroneously provision a volume on a node where it's inaccessible.
|
||||
|
||||
Alternative:
|
||||
1. `"csi.kubernetes.io/topology.example.com_rack": "rack1"`
|
||||
|
@ -557,8 +567,10 @@ Form 3 - Reduced by `zone`.
|
|||
```
|
||||
The provisioner will always choose Form 1, i.e. all `values` will have at most 1 element. Reduction logic could be added in future versions to arbitrarily choose a valid and simpler form like Forms 2 & 3.
|
||||
|
||||
#### Upgrades & Downgrades
|
||||
TODO verult
|
||||
#### Upgrade & Downgrade Considerations
|
||||
When drivers are uninstalled, topology information stored in Node labels remain untouched. The recommended label format allows multiple sources (such as CSI, networking resources, etc.) to share the same label key, so it's nontrivial to accurately determine whether a label is still used.
|
||||
|
||||
In order to upgrade drivers using the recommended driver deployment mechanism, the user is recommended to tear down the StatefulSet (controller components) before the DaemonSet (node components), and deploy the DaemonSet before the StatefulSet. There may be design improvements to eliminate this constraint, but it will be evaluated at a later iteration.
|
||||
|
||||
### Example Walkthrough
|
||||
|
||||
|
|
Loading…
Reference in New Issue