Topology aware dynamic provisioning design

This commit is contained in:
Michelle Au 2018-02-26 16:38:26 -08:00
parent 6e3e64ffb3
commit 1efe2e9971
1 changed files with 465 additions and 115 deletions

View File

@ -1,24 +1,33 @@
# Volume Topology-aware Scheduling
Authors: @msau42
Authors: @msau42, @lichuqiang
This document presents a detailed design for making the default Kubernetes
scheduler aware of volume topology constraints, and making the
PersistentVolumeClaim (PVC) binding aware of scheduling decisions.
## Definitions
* Topology: Rules to describe accessibility of an object with respect to
location in a cluster.
* Domain: A grouping of locations within a cluster. For example, 'node1',
'rack10', 'zone5'.
* Topology Key: A description of a general class of domains. For example,
'node', 'rack', 'zone'.
* Hierarchical domain: Domain that can be fully encompassed in a larger domain.
For example, the 'zone1' domain can be fully encompassed in the 'region1'
domain.
* Failover domain: A domain that a workload intends to run in at a later time.
## Goals
* Allow a Pod to request one or more topology-constrained Persistent
Volumes (PV) that are compatible with the Pod's other scheduling
constraints, such as resource requirements and affinity/anti-affinity
policies.
* Support arbitrary PV topology constraints (i.e. node,
rack, zone, foo, bar).
* Support topology constraints for statically created PVs and dynamically
provisioned PVs.
* Allow a Pod to request one or more Persistent Volumes (PV) with topology that
are compatible with the Pod's other scheduling constraints, such as resource
requirements and affinity/anti-affinity policies.
* Support arbitrary PV topology domains (i.e. node, rack, zone, foo, bar).
* Support topology for statically created PVs and dynamically provisioned PVs.
* No scheduling latency performance regression for Pods that do not use
topology-constrained PVs.
PVs with topology.
* Allow administrators to restrict allowed topologies per StorageClass.
* Allow storage providers to report capacity limits per topology domain.
## Non Goals
* Fitting a pod after the initial PVC binding has been completed.
@ -36,13 +45,34 @@ operator to schedule them together. Another alternative is to merge the two
pods into one.
* For two+ pods non-simultaneously sharing a PVC, this scenario could be
handled by pod priorities and preemption.
* Provisioning multi-domain volumes where all the domains will be able to run
the workload. For example, provisioning a multi-zonal volume and making sure
the pod can run in all zones.
* Scheduler cannot make decisions based off of future resource requirements,
especially if those resources can fluctuate over time. For applications that
use such multi-domain storage, the best practice is to either:
* Configure cluster autoscaling with enough resources to accomodate
failing over the workload to any of the other failover domains.
* Manually configure and overprovision the failover domains to
accomodate the resource requirements of the workload.
* Scheduler supporting volume topologies that are independent of the node's
topologies.
* The Kubernetes scheduler only handles topologies with respect to the
workload and the nodes it runs on. If a storage system is deployed on an
independent topology, it will be up to provisioner to correctly spread the
volumes for a workload. This could be facilitated as a separate feature
by:
* Passing the Pod's OwnerRef to the provisioner, and the provisioner
spreading volumes for Pods with the same OwnerRef
* Adding Volume Anti-Affinity policies, and passing those to the
provisioner.
## Problem
Volumes can have topology constraints that restrict the set of nodes that the
volume can be accessed on. For example, a GCE PD can only be accessed from a
single zone, and a local disk can only be accessed from a single node. In the
future, there could be other topology constraints, such as rack or region.
future, there could be other topology domains, such as rack or region.
A pod that uses such a volume must be scheduled to a node that fits within the
volumes topology constraints. In addition, a pod can have further constraints
@ -70,16 +100,21 @@ binding happens without considering if multiple PVCs are related, it is very lik
for the two PVCs to be bound to local disks on different nodes, making the pod
unschedulable.
* For multizone clusters and deployments requesting multiple dynamically provisioned
zonal PVs, each PVC Is provisioned independently, and is likely to provision each PV
In different zones, making the pod unschedulable.
zonal PVs, each PVC is provisioned independently, and is likely to provision each PV
in different zones, making the pod unschedulable.
To solve the issue of initial volume binding and provisioning causing an impossible
pod placement, volume binding and provisioning should be more tightly coupled with
pod scheduling.
## New Volume Topology Specification
To specify a volume's topology constraints in Kubernetes, the PersistentVolume
## Volume Topology Specification
First, volumes need a way to express topology constraints against nodes. Today, it
is done for zonal volumes by having explicit logic to process zone labels on the
PersistentVolume. However, this is not easily extendable for volumes with other
topology keys.
Instead, to support a generic specification, the PersistentVolume
object will be extended with a new NodeAffinity field that specifies the
constraints. It will closely mirror the existing NodeAffinity type used by
Pods, but we will use a new type so that we will not be bound by existing and
@ -107,18 +142,27 @@ weights, but will not be included in the initial implementation.
The advantages of this NodeAffinity field vs the existing method of using zone labels
on the PV are:
* We don't need to expose first-class labels for every topology domain.
* Implementation does not need to be updated every time a new topology domain
* We don't need to expose first-class labels for every topology key.
* Implementation does not need to be updated every time a new topology key
is added to the cluster.
* NodeSelector is able to express more complex topology with ANDs and ORs.
* NodeAffinity aligns with how topology is represented with other Kubernetes
resources.
Some downsides include:
* You can have a proliferation of Node labels if you are running many different
kinds of volume plugins, each with their own topology labeling scheme.
* The NodeSelector is more expressive than what most storage providers will
need. Most storage providers only need a single topology key with
one or more domains. Non-hierarchical domains may present implementation
challenges, and it will be difficult to express all the functionality
of a NodeSelector in a non-Kubernetes specification like CSI.
### Example PVs with NodeAffinity
#### Local Volume
In this example, the volume can only be accessed from nodes that have the
label key `kubernetes.io/hostname` and label value `node-1`.
```
apiVersion: v1
kind: PersistentVolume
@ -141,6 +185,9 @@ spec:
```
#### Zonal Volume
In this example, the volume can only be accessed from nodes that have the
label key `failure-domain.beta.kubernetes.io/zone' and label value
`us-central1-a`.
```
apiVersion: v1
kind: PersistentVolume
@ -164,6 +211,9 @@ spec:
```
#### Multi-Zonal Volume
In this example, the volume can only be accessed from nodes that have the
label key `failure-domain.beta.kubernetes.io/zone' and label value
`us-central1-a` OR 'us-central1-b.
```
apiVersion: v1
kind: PersistentVolume
@ -194,12 +244,11 @@ and region labels. This will handle newly created PV objects.
Existing PV objects will have to be upgraded to use the new NodeAffinity field.
This does not have to occur instantaneously, and can be updated within the
deprecation period.
deprecation period. This can be done through a new PV update controller that
will take existing zone labels on PVs and translate that into PV.NodeAffinity.
TODO: This can be done through one of the following methods:
- Manual updates/scripts
- cluster/update-storage-objects.sh?
- A new PV update controller
An admission controller can validate that the labels and PV.NodeAffinity for
this existing zonal volume types are kept in sync.
### Bound PVC Enforcement
For PVCs that are already bound to a PV with NodeAffinity, enforcement is
@ -225,27 +274,16 @@ Both binding decisions of:
will be considered by the scheduler, so that all of a Pod's scheduling
constraints can be evaluated at once.
The rest of this document describes the detailed design for implementing this
new volume binding behavior.
The detailed design for implementing this new volume binding behavior will be
described later in the scheduler integration section.
## Delayed Volume Binding
Today, volume binding occurs immediately once a PersistentVolumeClaim is
created. In order for volume binding to take into account all of a pod's other scheduling
constraints, volume binding must be delayed until a Pod is being scheduled.
## New Volume Binding Design
The design can be broken up into a few areas:
* User-facing API to invoke new behavior
* Integrating PV binding with pod scheduling
* Binding multiple PVCs as a single transaction
* Recovery from kubelet rejection of pod
* Making dynamic provisioning topology-aware
For the alpha phase, only the user-facing API and PV binding and scheduler
integration are necessary. The remaining areas can be handled in beta and GA
phases.
### User-facing API
In alpha, this feature is controlled by a feature gate, VolumeScheduling, and
must be configured in the kube-scheduler and kube-controller-manager.
A new StorageClass field will be added to control the volume binding behavior.
A new StorageClass field `VolumeBindingMode` will be added to control the volume
binding behavior.
```
type StorageClass struct {
@ -266,20 +304,306 @@ const (
This approach allows us to introduce the new binding behavior gradually and to
be able to maintain backwards compatibility without deprecation of previous
behavior. However, it has a few downsides:
behavior. However, it has a few downsides:
* StorageClass will be required to get the new binding behavior, even if dynamic
provisioning is not used (in the case of local storage).
* We have to maintain two different paths for volume binding.
* We have to maintain two different code paths for volume binding.
* We will be depending on the storage admin to correctly configure the
StorageClasses for the volume types that need the new binding behavior.
* User experience can be confusing because PVCs could have different binding
behavior depending on the StorageClass configuration. We will mitigate this by
adding a new PVC event to indicate if binding will follow the new behavior.
### Integrating binding with scheduling
For the alpha phase, the focus is on static provisioning of PVs to support
persistent local storage.
## Dynamic Provisioning with Topology
To make dynamic provisioning aware of pod scheduling decisions, delayed volume
binding must also be enabled. The scheduler will pass its selected node to the
dynamic provisioner, and the provisioner will create a volume in the topology
domain that the selected node is part of. The domain depends on the volume
plugin. Zonal volume plugins will create the volume in the zone where the
selected node is in. The local volume plugin will create the volume on the
selected node.
### End to End Zonal Example
This is an example of the most common use case for provisioning zonal volumes.
For this use case, the user's specs are unchanged. Only one change
to the StorageClass is needed to enable delayed volume binding.
1. Admin sets up StorageClass, setting up delayed volume binding.
```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: kubernetes.io/gce-pd
volumeBindingMode: WaitForFirstConsumer
parameters:
type: pd-standard
```
2. Admin launches provisioner. For in-tree plugins, nothing needs to be done.
3. User creates PVC. Nothing changes in the spec, although now the PVC won't be
immediately bound.
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
```
4. User creates Pod. Nothing changes in the spec.
```
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
...
volumes:
- name: my-vol
persistentVolumeClaim:
claimName: my-pvc
```
5. Scheduler picks a node that can satisfy the Pod and passes it to the
provisioner.
6. Provisioner dynamically provisions a PV that can be accessed from
that node.
```
apiVersion: v1
kind: PersistentVolume
metadata:
Name: volume-1
spec:
capacity:
storage: 100Gi
storageClassName: standard
gcePersistentDisk:
diskName: my-disk
fsType: ext4
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-central1-a
```
7. Pod gets scheduled to the node.
### Restricting Topology
For the common use case, volumes will be provisioned in whatever topology domain
the scheduler has decided is best to run the workload. Users may impose further
restrictions by setting label/node selectors, and pod affinity/anti-affinity
policies on their Pods. All those policies will be taken into account when
dynamically provisioning a volume.
While less common, administrators may want to further restrict what topology
domains are available to a StorageClass. To support these administrator
policies, a VolumeProvisioningTopology field can also be specified in the
StorageClass to restrict the allowed topology domains for dynamic provisioning.
This is not expected to be a common use case, and there are some caveats,
described below.
```
type StorageClass struct {
...
AllowedTopology *VolumeProvisioningTopology
}
type VolumeProvisioningTopology struct {
Required map[string]VolumeTopologyValues
}
type VolumeTopologyValues struct{
Values []string
}
```
A nil value means there are no topology restrictions. A scheduler predicate
will evaluate a non-nil value when considering dynamic provisioning for a node.
When evaluating the specified topology against a node, the node must have a
label for each topology key specified and one of its allowed values. For
example, a topology specified with "zone: [us-central1-a, us-central1-b]" means
that the node must have the label "zone: us-central1-a" or "zone:
us-central1-b".
A StorageClass update controller can be provided to convert existing topology
restrictions specified under plugin-specific parameters to the new AllowedTopology
field.
The AllowedTopology will also be provided to provisioners as a new field, detailed in
the provisioner section. Provisioners can use the allowed topology information
in the following scenarios:
* StorageClass is using the default immediate binding mode. This is the
legacy topology-unaware behavior. In this scenario, the volume could be
provisioned in a domain that cannot run the Pod since it doesn't take any
scheduler input.
* For volumes that span multiple domains, the AllowedTopology can restrict those
additional domains. However, special care must be taken to avoid specifying
conflicting topology constraints in the Pod. For example, the administrator could
restrict a multi-zonal volume to zones 'zone1' and 'zone2', but the Pod could have
constraints that restrict it to 'zone1' and 'zone3'. If 'zone1'
fails, the Pod cannot be scheduled to the intended failover zone.
Kubernetes will leave validation and enforcement of the AllowedTopology content up
to the provisioner.
TODO: would this be better represented as part of StorageClass quota since that
is where admins currently specify policies?
However, ResourceQuota currently only supports quantity, and topology
restrictions have to be evaluated during scheduling, not admission.
#### Zonal Example
This example restricts the volumes provisioned to zones us-central1-a and
us-central1-b.
```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
Name: zonal-class
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
allowedTopology:
required:
failure-domain.beta.kubernetes.io/zone:
values:
- us-central1-a
- us-central1-b
```
#### Multi-Zonal Example
This example restricts the volume's primary and failover zones
to us-central1-a and us-central1-b. Topologies that are incompatible with the
storage provider parameters will be enforced by the provisioner.
```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
Name: multi-zonal-class
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
replicas: 2
allowedTopology:
required:
failure-domain.beta.kubernetes.io/zone:
values:
- us-central1-a
- us-central1-b
```
### Storage Provider Capacity Reporting
Some volume types have capacity limitations per topology domain. For example,
local volumes have limited capacity per node. Provisioners for these volume
types can report these capacity limitations through a new StorageClass status
field.
```
type StorageClass struct {
...
Status *StorageClassStatus
}
type StorageClassStatus struct {
Capacity *StorageClassCapacity
}
type StorageClassCapacity struct {
TopologyKey string
Capacities map[string]resource.Quantity
}
```
If no StorageClassCapacity is specified, then there are no capacity
limitations.
Capacity reporting is limited to a single topology key. `TopologyKey`
specifies the label key that represents that domain. Capacities is
a map with key equal to the label value, and value is the capacity for that
specific domain. Capacity is the total capacity available, including
capacity that has already been provisioned. A missing map entry for a topology
domain will be treated as zero capacity.
A scheduler predicate will use the reported capacity to determine if dynamic
provisioning is possible for a given node, taking into account already
provisioned capacity.
Downsides of this approach include:
* Multiple topology keys is not supported for capacity reporting. We have not
seen such use cases so far.
* There is no common mechanism to report capacity. Each provisioner has to
update their own fields in the StorageClass if necessary. CSI could potentially
address this.
#### Local Volume Example
This example specifies that node1 has 100Gi, and node2 has 200Gi available
for dynamic provisioning. All other nodes have 0Gi available for dynamic
provisioning.
```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
...
status:
capacity:
topologyKey: kubernetes.io/hostname
capacities:
node1: 100Gi
node2: 200Gi
```
TODO: should this be in the local volume spec instead?
Some downsides specific to the local volume implementation:
* StorageClass object size can be big for large clusters. Assuming that
each node entry in the Capacities map could take up around 100 bytes
(size of node name + resource.Quantity), a 1000 node cluster that uses
dynamic provisioning of local volumes could add 100Ki to the StorageClass
object.
* If each node has a local volume provisioner, there can be many
StorageClass update conflicts if many try to update their capacity at the
same time.
* A security concern is that each local volume provsioner could potentially
modify the capacity reported by another node. This could cause
provisioning to fail if a node does not have enough capacity, or it could
cause the scheduler to skip nodes due to no capacity.
#### Relationship to StorageClassQuota
Although both the new StorageClassCapacity field and existing StorageClassQuota
manage storage consumption, they address different use cases.
The StorageClassCapacity field is actual available capacity per topology domain
reported by the storage system, while the StorageClassQuota represents policies
set by the cluster admin and does not take into account topology domain.
For example, a storage system is configured to support 200Gi in zone1 for a
StorageClass that is shared by multiple users. An administrator can use
StorageClassQuota to restrict capacity usage for each user.
## Feature Gates
PersistentVolume.NodeAffinity and StorageClas.VolumeBindingMode fields will be
controlled by the VolumeScheduling feature gate, and must be configured in the
kube-scheduler, kube-controller-manager, and all kubelets.
StorageClass.AllowedTopology and StorageClassCapacity fields will be controlled
by the DynamicProvisioningScheduling feature gate, and must be configured in the
kube-scheduler and kube-controller-manager.
## Integrating volume binding with pod scheduling
For the new volume binding mode, the proposed new workflow is:
1. Admin statically creates PVs and/or StorageClasses.
2. User creates unbound PVC and there are no prebound PVs for it.
@ -287,20 +611,22 @@ For the new volume binding mode, the proposed new workflow is:
references it.
4. User creates a pod that uses the PVC.
5. Pod starts to get processed by the scheduler.
6. **NEW:** A new predicate function, called MatchUnboundPVCs, will look at all of
a Pods unbound PVCs, and try to find matching PVs for that node based on the
PV topology. If there are no matching PVs, then it checks if dynamic
provisioning is possible for that node.
6. **NEW:** A new predicate function, called CheckVolumeBinding, will process
both bound and unbound PVCs of the Pod. It will validate the VolumeNodeAffinity
for bound PVCs. For unbound PVCs, it will try to find matching PVs for that node
based on the PV NodeAffinity. If there are no matching PVs, then it checks if
dynamic provisioning is possible for that node based on StorageClass
AllowedTopologies and Capacity.
7. **NEW:** The scheduler continues to evaluate priorities. A new priority
function, called PrioritizeUnboundPVCs, will get the PV matches per PVC per
function, called PrioritizeVolumes, will get the PV matches per PVC per
node, and compute a priority score based on various factors.
8. **NEW:** After evaluating all the existing predicates and priorities, the
scheduler will pick a node, and call a new assume function, AssumePVCs,
scheduler will pick a node, and call a new assume function, AssumeVolumes,
passing in the Node. The assume function will check if any binding or
provisioning operations need to be done. If so, it will update the PV cache to
mark the PVs with the chosen PVCs.
mark the PVs with the chosen PVCs and queue the Pod for volume binding.
9. **NEW:** If PVC binding or provisioning is required, we do NOT AssumePod.
Instead, a new bind function, BindPVCs, will be called asynchronously, passing
Instead, a new bind function, BindVolumes, will be called asynchronously, passing
in the selected node. The bind function will prebind the PV to the PVC, or
trigger dynamic provisioning. Then, it always sends the Pod through the
scheduler again for reasons explained later.
@ -328,15 +654,17 @@ avoid these error conditions are to:
* Separate out volumes that the user prebinds from the volumes that are
available for the system to choose from by StorageClass.
#### PV Controller Changes
### PV Controller Changes
When the feature gate is enabled, the PV controller needs to skip binding
unbound PVCs with VolumBindingWaitForFirstConsumer and no prebound PVs
to let it come through the scheduler path.
Dynamic provisioning will also be skipped if
VolumBindingWaitForFirstConsumer is set. The scheduler will signal to
VolumBindingWaitForFirstConsumer is set. The scheduler will signal to
the PV controller to start dynamic provisioning by setting the
`annStorageProvisioner` annotation in the PVC.
`annSelectedNode` annotation in the PVC. If provisioning fails, the PV
controller can signal back to the scheduler to retry dynamic provisioning by
removing the `annSelectedNode` annotation.
No other state machine changes are required. The PV controller continues to
handle the remaining scenarios without any change.
@ -344,14 +672,36 @@ handle the remaining scenarios without any change.
The methods to find matching PVs for a claim and prebind PVs need to be
refactored for use by the new scheduler functions.
#### Scheduler Changes
### Dynamic Provisioning interface changes
The dynamic provisioning interfaces will be updated to pass in:
* selectedNode, when late binding is enabled on the StorageClass
* allowedTopologies, when it is set in the StorageClass
##### Predicate
If selectedNode is set, the provisioner should get its appropriate topology
labels from the Node object, and provision a volume based on those topology
values. In the common use case for a volume supporting a single topology domain,
if nodeName is set, then allowedTopologies can be ignored.
In-tree provisioners:
```
Provision(selectedNode *string, allowedTopologies *storagev1.VolumeProvisioningTopology) (*v1.PersistentVolume, error)
```
External provisioners:
* selectedNode will be represented by the PVC annotation "volume.alpha.kubernetes.io/selectedNode"
* allowedTopologies must be obtained by looking at the StorageClass for the PVC.
#### New Permissions
Provisioners will need to be able to get Node and StorageClass objects.
### Scheduler Changes
#### Predicate
A new predicate function checks all of a Pod's unbound PVCs can be satisfied
by existing PVs or dynamically provisioned PVs that are
topologically-constrained to the Node.
```
MatchUnboundPVCs(pod *v1.Pod, node *v1.Node) (canBeBound bool, err error)
CheckVolumeBinding(pod *v1.Pod, node *v1.Node) (canBeBound bool, err error)
```
1. If all the Pods PVCs are bound, return true.
2. Otherwise try to find matching PVs for all of the unbound PVCs in order of
@ -361,18 +711,17 @@ decreasing requested capacity.
5. Temporarily cache this PV choice for the PVC per Node, for fast
processing later in the priority and bind functions.
6. Return true if all PVCs are matched.
7. If there are still unmatched PVCs, check if dynamic provisioning is possible.
For this alpha phase, the provisioner is not topology aware, so the predicate
will just return true if there is a provisioner specified in the StorageClass
(internal or external).
7. If there are still unmatched PVCs, check if dynamic provisioning is possible,
by evaluating StorageClass.AllowedTopology, and StorageClassCapacity. If so,
temporarily cache this decision in the PVC per Node.
8. Otherwise return false.
##### Priority
#### Priority
After all the predicates run, there is a reduced set of Nodes that can fit a
Pod. A new priority function will rank the remaining nodes based on the
unbound PVCs and their matching PVs.
```
PrioritizeUnboundPVCs(pod *v1.Pod, filteredNodes HostPriorityList) (rankedNodes HostPriorityList, err error)
PrioritizeVolumes(pod *v1.Pod, filteredNodes HostPriorityList) (rankedNodes HostPriorityList, err error)
```
1. For each Node, get the cached PV matches for the Pods PVCs.
2. Compute a priority score for the Node using the following factors:
@ -383,19 +732,19 @@ PrioritizeUnboundPVCs(pod *v1.Pod, filteredNodes HostPriorityList) (rankedNodes
TODO (beta): figure out weights and exact calculation
##### Assume
#### Assume
Once all the predicates and priorities have run, then the scheduler picks a
Node. Then we can bind or provision PVCs for that Node. For better scheduler
performance, well assume that the binding will likely succeed, and update the
PV cache first. Then the actual binding API update will be made
PV and PVC caches first. Then the actual binding API update will be made
asynchronously, and the scheduler can continue processing other Pods.
For the alpha phase, the AssumePVCs function will be directly called by the
For the alpha phase, the AssumeVolumes function will be directly called by the
scheduler. Well consider creating a generic scheduler interface in a
subsequent phase.
```
AssumePVCs(pod *v1.Pod, node *v1.Node) (pvcBindingRequired bool, err error)
AssumeVolumes(pod *v1.pod, node *v1.node) (pvcbindingrequired bool, err error)
```
1. If all the Pods PVCs are bound, return false.
2. For static PV binding:
@ -404,26 +753,30 @@ AssumePVCs(pod *v1.Pod, node *v1.Node) (pvcBindingRequired bool, err error)
3. Mark PV.ClaimRef in the PV cache.
4. Cache the PVs that need binding in the Pod object.
3. For in-tree and external dynamic provisioning:
1. Cache the PVCs that need provisioning in the Pod object.
1. Mark the PVC annSelectedNode in the PVC cache.
2. If the StorageClass has reported capacity limits, cache the topologyKey
value in the PVC annProvisionedTopology, and update the StorageClass
capacity cache.
3. Cache the PVCs that need provisioning in the Pod object.
4. Return true.
##### Bind
If AssumePVCs returns pvcBindingRequired, then the BindPVCs function is called
#### Bind
If AssumeVolumes returns pvcBindingRequired, then the BindPVCs function is called
as a go routine. Otherwise, we can continue with assuming and binding the Pod
to the Node.
For the alpha phase, the BindUnboundPVCs function will be directly called by the
For the alpha phase, the BindVolumes function will be directly called by the
scheduler. Well consider creating a generic scheduler interface in a subsequent
phase.
```
BindUnboundPVCs(pod *v1.Pod, node *v1.Node) (err error)
BindVolumes(pod *v1.Pod, node *v1.Node) (err error)
```
1. For static PV binding:
1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
2. If the prebind fails, revert the cache updates.
2. For in-tree and external dynamic provisioning:
1. Set `annStorageProvisioner` on the PVC.
1. Set `annScheduledNode` on the PVC.
3. Send Pod back through scheduling, regardless of success or failure.
1. In the case of success, we need one more pass through the scheduler in
order to evaluate other volume predicates that require the PVC to be bound, as
@ -433,16 +786,16 @@ described below.
TODO: pv controller has a high resync frequency, do we need something similar
for the scheduler too
##### Access Control
#### Access Control
Scheduler will need PV update permissions for prebinding static PVs, and PVC
modify permissions for triggering dynamic provisioning.
update permissions for triggering dynamic provisioning.
##### Pod preemption considerations
The MatchUnboundPVs predicate does not need to be re-evaluated for pod
#### Pod preemption considerations
The CheckVolumeBinding predicate does not need to be re-evaluated for pod
preemption. Preempting a pod that uses a PV will not free up capacity on that
node because the PV lifecycle is independent of the Pods lifecycle.
##### Other scheduler predicates
#### Other scheduler predicates
Currently, there are a few existing scheduler predicates that require the PVC
to be bound. The bound assumption needs to be changed in order to work with
this new workflow.
@ -452,7 +805,7 @@ running predicates? One possible way is to mark at the beginning of scheduling
a Pod if all PVCs were bound. Then we can check if a second scheduler pass is
needed.
###### Max PD Volume Count Predicate
##### Max PD Volume Count Predicate
This predicate checks the maximum number of PDs per node is not exceeded. It
needs to be integrated into the binding decision so that we dont bind or
provision a PV if its going to cause the node to exceed the max PD limit. But
@ -460,7 +813,7 @@ until it is integrated, we need to make one more pass in the scheduler after all
the PVCs are bound. The current copy of the predicate in the default scheduler
has to remain to account for the already-bound volumes.
###### Volume Zone Predicate
##### Volume Zone Predicate
This predicate makes sure that the zone label on a PV matches the zone label of
the node. If the volume is not bound, this predicate can be ignored, as the
binding logic will take into account zone constraints on the PV.
@ -475,18 +828,18 @@ This predicate needs to remain in the default scheduler to handle the
already-bound volumes using the old zonal labeling. It can be removed once that
mechanism is deprecated and unsupported.
###### Volume Node Predicate
##### Volume Node Predicate
This is a new predicate added in 1.7 to handle the new PV node affinity. It
evaluates the node affinity against the nodes labels to determine if the pod
can be scheduled on that node. If the volume is not bound, this predicate can
be ignored, as the binding logic will take into account the PV node affinity.
##### Caching
There are two new caches needed in the scheduler.
#### Caching
There are three new caches needed in the scheduler.
The first cache is for handling the PV/PVC API binding updates occurring
asynchronously with the main scheduler loop. `AssumePVCs` needs to store
the updated API objects before `BindUnboundPVCs` makes the API update, so
asynchronously with the main scheduler loop. `AssumeVolumes` needs to store
the updated API objects before `BindVolumes` makes the API update, so
that future binding decisions will not choose any assumed PVs. In addition,
if the API update fails, the cached updates need to be reverted and restored
with the actual API object. The cache will return either the cached-only
@ -507,6 +860,16 @@ all the volume predicates are fully run once all PVCs are bound.
* Caching PV matches per node decisions that the predicate had made. This is
an optimization to avoid walking through all the PVs again in priority and
assume functions.
* Caching PVC dynamic provisioning decisions per node that the predicate had
made.
The third cache is for storing provisioned capacity per topology domain for
those StorageClasses that report capacity limits. Since the capacity reported in
the StorageClass is total capacity, we need to calculate used capacity by adding
up the capacities of all the PVCs provisioned for that StorageClass in each
domain. We can cache these used capacities so we don't have to recalculate them
every time in the predicate.
#### Performance and Optimizations
Let:
@ -541,7 +904,7 @@ match against the PVs in the nodes PV map instead of the cluster-wide PV list
For the alpha phase, the optimizations are not required. However, they should
be required for beta and GA.
#### Packaging
### Packaging
The new bind logic that is invoked by the scheduler can be packaged in a few
ways:
* As a library to be directly called in the default scheduler
@ -556,7 +919,7 @@ for more race conditions due to the caches being out of sync.
because the schedulers cache and PV controllers cache have different interfaces
and private methods.
##### Extender cons
#### Extender cons
However, the cons of the extender approach outweighs the cons of the library
approach.
@ -578,9 +941,9 @@ Kubernetes.
With all this complexity, the library approach is the most feasible in a single
release time frame, and aligns better with the current Kubernetes architecture.
#### Downsides
### Downsides
##### Unsupported Use Cases
#### Unsupported Use Cases
The following use cases will not be supported for PVCs with a StorageClass with
VolumeBindingWaitForFirstConsumer:
* Directly setting Pod.Spec.NodeName
@ -589,7 +952,7 @@ VolumeBindingWaitForFirstConsumer:
These two use cases will bypass the default scheduler and thus will not
trigger PV binding.
##### Custom Schedulers
#### Custom Schedulers
Custom schedulers, controllers and operators that handle pod scheduling and want
to support this new volume binding mode will also need to handle the volume
binding decision.
@ -604,7 +967,7 @@ easier for custom schedulers to include in their own implementation.
In general, many advanced scheduling features have been added into the default
scheduler, such that it is becoming more difficult to run without it.
##### HA Master Upgrades
#### HA Master Upgrades
HA masters adds a bit of complexity to this design because the active scheduler
process and active controller-manager (PV controller) process can be on different
nodes. That means during an HA master upgrade, the scheduler and controller-manager
@ -624,9 +987,9 @@ all dependencies are at the required versions.
For alpha, this is not concerning, but it needs to be solved by GA.
#### Other Alternatives Considered
### Other Alternatives Considered
##### One scheduler function
#### One scheduler function
An alternative design considered was to do the predicate, priority and bind
functions all in one function at the end right before Pod binding, in order to
reduce the number of passes we have to make over all the PVs. However, this
@ -641,7 +1004,7 @@ on a Node that the higher priority pod still cannot run on due to PVC
requirements. For that reason, the PVC binding decision needs to be have its
predicate function separated out and evaluated with the rest of the predicates.
##### Pull entire PVC binding into the scheduler
#### Pull entire PVC binding into the scheduler
The proposed design only has the scheduler initiating the binding transaction
by prebinding the PV. An alternative is to pull the whole two-way binding
transaction into the scheduler, but there are some complex scenarios that
@ -653,7 +1016,7 @@ schedulers Pod sync loop cannot handle:
Handling these scenarios in the schedulers Pod sync loop is not possible, so
they have to remain in the PV controller.
##### Keep all PVC binding in the PV controller
#### Keep all PVC binding in the PV controller
Instead of initiating PV binding in the scheduler, have the PV controller wait
until the Pod has been scheduled to a Node, and then try to bind based on the
chosen Node. A new scheduling predicate is still needed to filter and match
@ -685,7 +1048,7 @@ can make a lot of wrong decisions after the restart.
evaluated. To solve this, all the volume predicates need to also be built into
the PV controller when matching possible PVs.
##### Move PVC binding to kubelet
#### Move PVC binding to kubelet
Looking into the future, with the potential for NUMA-aware scheduling, you could
have a sub-scheduler on each node to handle the pod scheduling within a node. It
could make sense to have the volume binding as part of this sub-scheduler, to make
@ -699,7 +1062,7 @@ to just that node, but for zonal storage, it could see all the PVs in that zone.
In addition, the sub-scheduler is just a thought at this point, and there are no
concrete proposals in this area yet.
### Binding multiple PVCs in one transaction
## Binding multiple PVCs in one transaction
There are no plans to handle this, but a possible solution is presented here if the
need arises in the future. Since the scheduler is serialized, a partial binding
failure should be a rare occurrence and would only be caused if there is a user or
@ -720,25 +1083,12 @@ If scheduling fails, update all bound PVCs with an annotation,
are clean. Scheduler and kubelet needs to reject pods with PVCs that are
undergoing rollback.
### Recovering from kubelet rejection of pod
## Recovering from kubelet rejection of pod
We can use the same rollback mechanism as above to handle this case.
If kubelet rejects a pod, it will go back to scheduling. If the scheduler
cannot find a node for the pod, then it will encounter scheduling failure and
initiate the rollback.
### Making dynamic provisioning topology aware
TODO (beta): Design details
For alpha, we are not focusing on this use case. But it should be able to
follow the new workflow closely with some modifications.
* The FindUnboundPVCs predicate function needs to get provisionable capacity per
topology dimension from the provisioner somehow.
* The PrioritizeUnboundPVCs priority function can add a new priority score factor
based on available capacity per node.
* The BindUnboundPVCs bind function needs to pass in the node to the provisioner.
The internal and external provisioning APIs need to be updated to take in a node
parameter.
## Testing