Merge pull request #1054 from msau42/storage-topology-design
Automatic merge from submit-queue. Volume topology aware scheduling design Proposal for a smarter scheduler that influences PV binding. Part of kubernetes/features#121 /sig storage /sig scheduling /cc @kubernetes/sig-storage-proposals @kubernetes/sig-scheduling-proposals
This commit is contained in:
commit
f20eb8c175
|
@ -0,0 +1,672 @@
|
|||
# Volume Topology-aware Scheduling
|
||||
|
||||
Authors: @msau42
|
||||
|
||||
This document presents a detailed design for making the default Kubernetes
|
||||
scheduler aware of volume topology constraints, and making the
|
||||
PersistentVolumeClaim (PVC) binding aware of scheduling decisions.
|
||||
|
||||
|
||||
## Goals
|
||||
* Allow a Pod to request one or more topology-constrained Persistent
|
||||
Volumes (PV) that are compatible with the Pod's other scheduling
|
||||
constraints, such as resource requirements and affinity/anti-affinity
|
||||
policies.
|
||||
* Support arbitrary PV topology constraints (i.e. node,
|
||||
rack, zone, foo, bar).
|
||||
* Support topology constraints for statically created PVs and dynamically
|
||||
provisioned PVs.
|
||||
* No scheduling latency performance regression for Pods that do not use
|
||||
topology-constrained PVs.
|
||||
|
||||
|
||||
## Non Goals
|
||||
* Fitting a pod after the initial PVC binding has been completed.
|
||||
* The more constraints you add to your pod, the less flexible it becomes
|
||||
in terms of placement. Because of this, tightly constrained storage, such as
|
||||
local storage, is only recommended for specific use cases, and the pods should
|
||||
have higher priority in order to preempt lower priority pods from the node.
|
||||
* Binding decision considering scheduling constraints from two or more pods
|
||||
sharing the same PVC.
|
||||
* The scheduler itself only handles one pod at a time. It’s possible the
|
||||
two pods may not run at the same time either, so there’s no guarantee that you
|
||||
will know both pod’s requirements at once.
|
||||
* For two+ pods simultaneously sharing a PVC, this scenario may require an
|
||||
operator to schedule them together. Another alternative is to merge the two
|
||||
pods into one.
|
||||
* For two+ pods non-simultaneously sharing a PVC, this scenario could be
|
||||
handled by pod priorities and preemption.
|
||||
|
||||
|
||||
## Problem
|
||||
Volumes can have topology constraints that restrict the set of nodes that the
|
||||
volume can be accessed on. For example, a GCE PD can only be accessed from a
|
||||
single zone, and a local disk can only be accessed from a single node. In the
|
||||
future, there could be other topology constraints, such as rack or region.
|
||||
|
||||
A pod that uses such a volume must be scheduled to a node that fits within the
|
||||
volume’s topology constraints. In addition, a pod can have further constraints
|
||||
and limitations, such as the pod’s resource requests (cpu, memory, etc), and
|
||||
pod/node affinity and anti-affinity policies.
|
||||
|
||||
Currently, the process of binding and provisioning volumes are done before a pod
|
||||
is scheduled. Therefore, it cannot take into account any of the pod’s other
|
||||
scheduling constraints. This makes it possible for the PV controller to bind a
|
||||
PVC to a PV or provision a PV with constraints that can make a pod unschedulable.
|
||||
|
||||
### Examples
|
||||
* In multizone clusters, the PV controller has a hardcoded heuristic to provision
|
||||
PVCs for StatefulSets spread across zones. If that zone does not have enough
|
||||
cpu/memory capacity to fit the pod, then the pod is stuck in pending state because
|
||||
its volume is bound to that zone.
|
||||
* Local storage exasperates this issue. The chance of a node not having enough
|
||||
cpu/memory is higher than the chance of a zone not having enough cpu/memory.
|
||||
* Local storage PVC binding does not have any node spreading logic. So local PV
|
||||
binding will very likely conflict with any pod anti-affinity policies if there is
|
||||
more than one local PV on a node.
|
||||
* A pod may need multiple PVCs. As an example, one PVC can point to a local SSD for
|
||||
fast data access, and another PVC can point to a local HDD for logging. Since PVC
|
||||
binding happens without considering if multiple PVCs are related, it is very likely
|
||||
for the two PVCs to be bound to local disks on different nodes, making the pod
|
||||
unschedulable.
|
||||
* For multizone clusters and deployments requesting multiple dynamically provisioned
|
||||
zonal PVs, each PVC Is provisioned independently, and is likely to provision each PV
|
||||
In different zones, making the pod unschedulable.
|
||||
|
||||
To solve the issue of initial volume binding and provisioning causing an impossible
|
||||
pod placement, volume binding and provisioning should be more tightly coupled with
|
||||
pod scheduling.
|
||||
|
||||
|
||||
## Background
|
||||
In 1.7, we added alpha support for [local PVs](local-storage-pv) with node affinity.
|
||||
You can specify a PV object with node affinity, and if a pod is using such a PV,
|
||||
the scheduler will evaluate the PV node affinity in addition to the other
|
||||
scheduling predicates. So far, the PV node affinity only influences pod
|
||||
scheduling once the PVC is already bound. The initial PVC binding decision was
|
||||
unchanged. This proposal addresses the initial PVC binding decision.
|
||||
|
||||
|
||||
## Design
|
||||
The design can be broken up into a few areas:
|
||||
* User-facing API to invoke new behavior
|
||||
* Integrating PV binding with pod scheduling
|
||||
* Binding multiple PVCs as a single transaction
|
||||
* Recovery from kubelet rejection of pod
|
||||
* Making dynamic provisioning topology-aware
|
||||
|
||||
For the alpha phase, only the user-facing API and PV binding and scheduler
|
||||
integration are necessary. The remaining areas can be handled in beta and GA
|
||||
phases.
|
||||
|
||||
### User-facing API
|
||||
This new binding behavior will apply to all unbound PVCs and can be controlled
|
||||
through install-time configuration passed to the scheduler and controller-manager.
|
||||
|
||||
In alpha, this configuration can be controlled by a feature gate,
|
||||
`VolumeTopologyScheduling`.
|
||||
|
||||
#### Alternative
|
||||
An alternative approach is to only trigger the new behavior only for volumes that
|
||||
have topology constraints. A new annotation can be added to the StorageClass to
|
||||
indicate that its volumes have constraints and will need to use the new
|
||||
topology-aware scheduling logic. The PV controller checks for this annotation
|
||||
every time it handles an unbound PVC, and will delay binding if it's set.
|
||||
|
||||
```
|
||||
// Value is “true” or “false”. Default is false.
|
||||
const AlphaVolumeTopologySchedulingAnnotation = “volume.alpha.kubernetes.io/topology-scheduling”
|
||||
```
|
||||
|
||||
While this approach can let us introduce the new behavior gradually, it has a
|
||||
few downsides:
|
||||
* StorageClass will be required to use this new logic, even if dynamic
|
||||
provisioning is not used (in the case of local storage).
|
||||
* We have to maintain two different paths for volume binding.
|
||||
* We will be depending on the storage admin to correctly configure the
|
||||
StorageClasses.
|
||||
|
||||
For those reasons, we prefer to switch the behavior for all unbound PVCs and
|
||||
clearly communicate these changes to the community before changing the behavior
|
||||
by default in GA.
|
||||
|
||||
### Integrating binding with scheduling
|
||||
For the alpha phase, the focus is on static provisioning of PVs to support
|
||||
persistent local storage.
|
||||
|
||||
TODO: flow chart
|
||||
|
||||
The proposed new workflow for volume binding is:
|
||||
1. Admin statically creates PVs and/or StorageClasses.
|
||||
2. User creates unbound PVC and there are no prebound PVs for it.
|
||||
3. **NEW:** PVC binding and provisioning is delayed until a pod is created that
|
||||
references it.
|
||||
4. User creates a pod that uses the PVC.
|
||||
5. Pod starts to get processed by the scheduler.
|
||||
6. **NEW:** A new predicate function, called MatchUnboundPVCs, will look at all of
|
||||
a Pod’s unbound PVCs, and try to find matching PVs for that node based on the
|
||||
PV topology. If there are no matching PVs, then it checks if dynamic
|
||||
provisioning is possible for that node.
|
||||
7. **NEW:** The scheduler continues to evaluate priorities. A new priority
|
||||
function, called PrioritizeUnboundPVCs, will get the PV matches per PVC per
|
||||
node, and compute a priority score based on various factors.
|
||||
8. **NEW:** After evaluating all the existing predicates and priorities, the
|
||||
scheduler will pick a node, and call a new assume function, AssumePVCs,
|
||||
passing in the Node. The assume function will check if any binding or
|
||||
provisioning operations need to be done. If so, it will update the PV cache to
|
||||
mark the PVs with the chosen PVCs.
|
||||
9. **NEW:** If PVC binding or provisioning is required, we do NOT AssumePod.
|
||||
Instead, a new bind function, BindPVCs, will be called asynchronously, passing
|
||||
in the selected node. The bind function will prebind the PV to the PVC, or
|
||||
trigger dynamic provisioning, and then wait until the PVCs are bound
|
||||
successfully or encounter failure. Then, it always sends the Pod through the
|
||||
scheduler again for reasons explained later.
|
||||
10. When a Pod makes a successful scheduler pass once all PVCs are bound, the
|
||||
scheduler assumes and binds the Pod to a Node.
|
||||
11. Kubelet starts the Pod.
|
||||
|
||||
This new workflow will have the scheduler handle unbound PVCs by choosing PVs
|
||||
and prebinding them to the PVCs. The PV controller completes the binding
|
||||
transaction, handling it as a prebound PV scenario.
|
||||
|
||||
Prebound PVCs and PVs will still immediately be bound by the PV controller.
|
||||
|
||||
One important point to note is that for the alpha phase, manual recovery is
|
||||
required in following error conditions:
|
||||
* A Pod has multiple PVCs, and only a subset of them successfully bind.
|
||||
* The scheduler chose a PV and prebound it, but the PVC could not be bound.
|
||||
|
||||
In both of these scenarios, the primary cause for these errors is if a user
|
||||
prebinds other PVCs to a PV that the scheduler chose. Some workarounds to
|
||||
avoid these error conditions are to:
|
||||
* Prebind the PV instead.
|
||||
* Separate out volumes that the user prebinds from the volumes that are
|
||||
available for the system to choose from by StorageClass.
|
||||
|
||||
#### PV Controller Changes
|
||||
When the feature gate is enabled, the PV controller needs to skip binding and
|
||||
provisioning all unbound PVCs that don’t have prebound PVs, and let it come
|
||||
through the scheduler path. This is the block in `syncUnboundClaim` where
|
||||
`claim.Spec.VolumeName == “”` and there is no PV that is prebound to the PVC.
|
||||
|
||||
No other state machine changes are required. The PV controller continues to
|
||||
handle the remaining scenarios without any change.
|
||||
|
||||
The methods to find matching PVs for a claim, prebind PVs, and to invoke and
|
||||
handle dynamic provisioning need to be refactored for use by the new scheduler
|
||||
functions.
|
||||
|
||||
#### Scheduler Changes
|
||||
|
||||
##### Predicate
|
||||
A new predicate function checks all of a Pod's unbound PVCs can be satisfied
|
||||
by existing PVs or dynamically provisioned PVs that are
|
||||
topologically-constrained to the Node.
|
||||
```
|
||||
MatchUnboundPVCs(pod *v1.Pod, node *v1.Node) (canBeBound bool, err error)
|
||||
```
|
||||
1. If all the Pod’s PVCs are bound, return true.
|
||||
2. Otherwise try to find matching PVs for all of the unbound PVCs in order of
|
||||
decreasing requested capacity.
|
||||
3. Walk through all the PVs.
|
||||
4. Find best matching PV for the PVC where PV topology is satisfied by the Node.
|
||||
5. Temporarily cache this PV in the PVC object, keyed by Node, for fast
|
||||
processing later in the priority and bind functions.
|
||||
6. Return true if all PVCs are matched.
|
||||
7. If there are still unmatched PVCs, check if dynamic provisioning is possible.
|
||||
For this alpha phase, the provisioner is not topology aware, so the predicate
|
||||
will just return true if there is a provisioner specified in the StorageClass
|
||||
(internal or external).
|
||||
8. Otherwise return false.
|
||||
|
||||
TODO: caching format and details
|
||||
|
||||
##### Priority
|
||||
After all the predicates run, there is a reduced set of Nodes that can fit a
|
||||
Pod. A new priority function will rank the remaining nodes based on the
|
||||
unbound PVCs and their matching PVs.
|
||||
```
|
||||
PrioritizeUnboundPVCs(pod *v1.Pod, filteredNodes HostPriorityList) (rankedNodes HostPriorityList, err error)
|
||||
```
|
||||
1. For each Node, get the cached PV matches for the Pod’s PVCs.
|
||||
2. Compute a priority score for the Node using the following factors:
|
||||
1. How close the PVC’s requested capacity and PV’s capacity are.
|
||||
2. Matching static PVs is preferred over dynamic provisioning because we
|
||||
assume that the administrator has specifically created these PVs for
|
||||
the Pod.
|
||||
|
||||
TODO (beta): figure out weights and exact calculation
|
||||
|
||||
##### Assume
|
||||
Once all the predicates and priorities have run, then the scheduler picks a
|
||||
Node. Then we can bind or provision PVCs for that Node. For better scheduler
|
||||
performance, we’ll assume that the binding will likely succeed, and update the
|
||||
PV cache first. Then the actual binding API update will be made
|
||||
asynchronously, and the scheduler can continue processing other Pods.
|
||||
|
||||
For the alpha phase, the AssumePVCs function will be directly called by the
|
||||
scheduler. We’ll consider creating a generic scheduler interface in a
|
||||
subsequent phase.
|
||||
|
||||
```
|
||||
AssumePVCs(pod *v1.Pod, node *v1.Node) (pvcBindingRequired bool, err error)
|
||||
```
|
||||
1. If all the Pod’s PVCs are bound, return false.
|
||||
2. For static PV binding:
|
||||
1. Get the cached matching PVs for the PVCs on that Node.
|
||||
2. Validate the actual PV state.
|
||||
3. Mark PV.ClaimRef in the PV cache.
|
||||
3. For in-tree and external dynamic provisioning:
|
||||
1. Nothing.
|
||||
4. Return true.
|
||||
|
||||
|
||||
##### Bind
|
||||
If AssumePVCs returns pvcBindingRequired, then the BindPVCs function is called
|
||||
as a go routine. Otherwise, we can continue with assuming and binding the Pod
|
||||
to the Node.
|
||||
|
||||
For the alpha phase, the BindUnboundPVCs function will be directly called by the
|
||||
scheduler. We’ll consider creating a generic scheduler interface in a subsequent
|
||||
phase.
|
||||
|
||||
```
|
||||
BindUnboundPVCs(pod *v1.Pod, node *v1.Node) (err error)
|
||||
```
|
||||
1. For static PV binding:
|
||||
1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
|
||||
2. If the prebind fails, revert the cache updates.
|
||||
3. Otherwise, wait for the PVCs to be bound, PVC/PV object is deleted, or
|
||||
PV.ClaimRef field is cleared
|
||||
2. For in-tree dynamic provisioning:
|
||||
1. Make the provision call, which will create a new PV object that is
|
||||
prebound to the PVC.
|
||||
2. If provisioning is successful, wait for the PVCs to be bound, PVC/PV
|
||||
object is deleted, or PV.ClaimRef field is cleared
|
||||
3. For external provisioning:
|
||||
1. Set the annotation for external provisioners.
|
||||
4. Send Pod back through scheduling, regardless of success or failure.
|
||||
1. In the case of success, we need one more pass through the scheduler in
|
||||
order to evaluate other volume predicates that require the PVC to be bound, as
|
||||
described below.
|
||||
2. In the case of failure, we want to retry binding/provisioning.
|
||||
|
||||
Note that for external provisioning, we do not wait for the PVCs to be bound, so
|
||||
the Pod will be sent through scheduling repeatedly until the PVCs are bound.
|
||||
This is because there is no function call for external provisioning, so if we
|
||||
did wait, we could be waiting forever for the PVC to bind. It’s possible that
|
||||
in the meantime, a user could create a PV that satisfies the PVC and doesn’t
|
||||
need the external provisioner anymore.
|
||||
|
||||
TODO: pv controller has a high resync frequency, do we need something similar
|
||||
for the scheduler too
|
||||
|
||||
##### Pod preemption considerations
|
||||
The MatchUnboundPVs predicate does not need to be re-evaluated for pod
|
||||
preemption. Preempting a pod that uses a PV will not free up capacity on that
|
||||
node because the PV lifecycle is independent of the Pod’s lifecycle.
|
||||
|
||||
##### Other scheduler predicates
|
||||
Currently, there are a few existing scheduler predicates that require the PVC
|
||||
to be bound. The bound assumption needs to be changed in order to work with
|
||||
this new workflow.
|
||||
|
||||
TODO: how to handle race condition of PVCs becoming bound in the middle of
|
||||
running predicates? One possible way is to mark at the beginning of scheduling
|
||||
a Pod if all PVCs were bound. Then we can check if a second scheduler pass is
|
||||
needed.
|
||||
|
||||
###### Max PD Volume Count Predicate
|
||||
This predicate checks the maximum number of PDs per node is not exceeded. It
|
||||
needs to be integrated into the binding decision so that we don’t bind or
|
||||
provision a PV if it’s going to cause the node to exceed the max PD limit. But
|
||||
until it is integrated, we need to make one more pass in the scheduler after all
|
||||
the PVCs are bound. The current copy of the predicate in the default scheduler
|
||||
has to remain to account for the already-bound volumes.
|
||||
|
||||
###### Volume Zone Predicate
|
||||
This predicate makes sure that the zone label on a PV matches the zone label of
|
||||
the node. If the volume is not bound, this predicate can be ignored, as the
|
||||
binding logic will take into account zone constraints on the PV.
|
||||
|
||||
However, this assumes that zonal PVs like GCE PDs and AWS EBS have been updated
|
||||
to use the new PV topology specification, which is not the case as of 1.8. So
|
||||
until those plugins are updated, the binding and provisioning decisions will be
|
||||
topology-unaware, and we need to make one more pass in the scheduler after all
|
||||
the PVCs are bound.
|
||||
|
||||
This predicate needs to remain in the default scheduler to handle the
|
||||
already-bound volumes using the old zonal labeling. It can be removed once that
|
||||
mechanism is deprecated and unsupported.
|
||||
|
||||
###### Volume Node Predicate
|
||||
This is a new predicate added in 1.7 to handle the new PV node affinity. It
|
||||
evaluates the node affinity against the node’s labels to determine if the pod
|
||||
can be scheduled on that node. If the volume is not bound, this predicate can
|
||||
be ignored, as the binding logic will take into account the PV node affinity.
|
||||
|
||||
#### Performance and Optimizations
|
||||
Let:
|
||||
* N = number of nodes
|
||||
* V = number of all PVs
|
||||
* C = number of claims in a pod
|
||||
|
||||
C is expected to be very small (< 5) so shouldn’t factor in.
|
||||
|
||||
The current PV binding mechanism just walks through all the PVs once, so its
|
||||
running time O(V).
|
||||
|
||||
Without any optimizations, the new PV binding mechanism has to run through all
|
||||
PVs for every node, so its running time is O(NV).
|
||||
|
||||
A few optimizations can be made to improve the performance:
|
||||
|
||||
1. Optimizing for PVs that don’t use node affinity (to prevent performance
|
||||
regression):
|
||||
1. Index the PVs by StorageClass and only search the PV list with matching
|
||||
StorageClass.
|
||||
2. Keep temporary state in the PVC cache if we previously succeeded or
|
||||
failed to match PVs, and if none of the PVs have node affinity. Then we can
|
||||
skip PV matching on subsequent nodes, and just return the result of the first
|
||||
attempt.
|
||||
2. Optimizing for PVs that have node affinity:
|
||||
1. When a static PV is created, if node affinity is present, evaluate it
|
||||
against all the nodes. For each node, keep an in-memory map of all its PVs
|
||||
keyed by StorageClass. When finding matching PVs for a particular node, try to
|
||||
match against the PVs in the node’s PV map instead of the cluster-wide PV list.
|
||||
|
||||
For the alpha phase, the optimizations are not required. However, they should
|
||||
be required for beta and GA.
|
||||
|
||||
#### Packaging
|
||||
The new bind logic that is invoked by the scheduler can be packaged in a few
|
||||
ways:
|
||||
* As a library to be directly called in the default scheduler
|
||||
* As a scheduler extender
|
||||
|
||||
We propose taking the library approach, as this method is simplest to release
|
||||
and deploy. Some downsides are:
|
||||
* The binding logic will be executed using two different caches, one in the
|
||||
scheduler process, and one in the PV controller process. There is the potential
|
||||
for more race conditions due to the caches being out of sync.
|
||||
* Refactoring the binding logic into a common library is more challenging
|
||||
because the scheduler’s cache and PV controller’s cache have different interfaces
|
||||
and private methods.
|
||||
|
||||
##### Extender cons
|
||||
However, the cons of the extender approach outweighs the cons of the library
|
||||
approach.
|
||||
|
||||
With an extender approach, the PV controller could implement the scheduler
|
||||
extender HTTP endpoint, and the advantage is the binding logic triggered by the
|
||||
scheduler can share the same caches and state as the PV controller.
|
||||
|
||||
However, deployment of this scheduler extender in a master HA configuration is
|
||||
extremely complex. The scheduler has to be configured with the hostname or IP of
|
||||
the PV controller. In a HA setup, the active scheduler and active PV controller
|
||||
could run on the same, or different node, and the node can change at any time.
|
||||
Exporting a network endpoint in the controller manager process is unprecedented
|
||||
and there would be many additional features required, such as adding a mechanism
|
||||
to get a stable network name, adding authorization and access control, and
|
||||
dealing with DDOS attacks and other potential security issues. Adding to those
|
||||
challenges is the fact that there are countless ways for users to deploy
|
||||
Kubernetes.
|
||||
|
||||
With all this complexity, the library approach is the most feasible in a single
|
||||
release time frame, and aligns better with the current Kubernetes architecture.
|
||||
|
||||
#### Downsides/Problems
|
||||
|
||||
##### Expanding Scheduler Scope
|
||||
This approach expands the role of the Kubernetes default scheduler to also
|
||||
schedule PVCs, and not just Pods. This breaks some fundamental design
|
||||
assumptions in the scheduler, primarily that if `Pod.Spec.NodeName`
|
||||
is set, then the Pod is scheduled. Users can set the `NodeName` field directly,
|
||||
and controllers like DaemonSets also currently bypass the scheduler.
|
||||
|
||||
For this approach to be fully functional, we need to change the scheduling
|
||||
criteria to also include unbound PVCs.
|
||||
|
||||
TODO: details
|
||||
|
||||
A general extension mechanism is needed to support scheduling and binding other
|
||||
objects in the future.
|
||||
|
||||
##### Impact on Custom Schedulers
|
||||
This approach is going to make it harder to run custom schedulers, controllers and
|
||||
operators if you use PVCs and PVs. This adds a requirement to schedulers
|
||||
that they also need to make the PV binding decision.
|
||||
|
||||
There are a few ways to mitigate the impact:
|
||||
* Custom schedulers could be implemented through the scheduler extender
|
||||
interface. This allows the default scheduler to be run in addition to the
|
||||
custom scheduling logic.
|
||||
* The new code for this implementation will be packaged as a library to make it
|
||||
easier for custom schedulers to include in their own implementation.
|
||||
* Ample notice of feature/behavioral deprecation will be given, with at least
|
||||
the amount of time defined in the Kubernetes deprecation policy.
|
||||
|
||||
In general, many advanced scheduling features have been added into the default
|
||||
scheduler, such that it is becoming more difficult to run custom schedulers with
|
||||
the new features.
|
||||
|
||||
##### HA Master Upgrades
|
||||
HA masters adds a bit of complexity to this design because the active scheduler
|
||||
process and active controller-manager (PV controller) process can be on different
|
||||
nodes. That means during an HA master upgrade, the scheduler and controller-manager
|
||||
can be on different versions.
|
||||
|
||||
The scenario where the scheduler is newer than the PV controller is fine. PV
|
||||
binding will not be delayed and in successful scenarios, all PVCs will be bound
|
||||
before coming to the scheduler.
|
||||
|
||||
However, if the PV controller is newer than the scheduler, then PV binding will
|
||||
be delayed, and the scheduler does not have the logic to choose and prebind PVs.
|
||||
That will cause PVCs to remain unbound and the Pod will remain unschedulable.
|
||||
|
||||
TODO: One way to solve this is to have some new mechanism to feature gate system
|
||||
components based on versions. That way, the new feature is not turned on until
|
||||
all dependencies are at the required versions.
|
||||
|
||||
For alpha, this is not concerning, but it needs to be solved by GA when the new
|
||||
functionality is enabled by default.
|
||||
|
||||
|
||||
#### Other Alternatives Considered
|
||||
|
||||
##### One scheduler function
|
||||
An alternative design considered was to do the predicate, priority and bind
|
||||
functions all in one function at the end right before Pod binding, in order to
|
||||
reduce the number of passes we have to make over all the PVs. However, this
|
||||
design does not work well with pod preemption. Pod preemption needs to be able
|
||||
to evaluate if evicting a lower priority Pod will make a higher priority Pod
|
||||
schedulable, and it does this by re-evaluating predicates without the lower
|
||||
priority Pod.
|
||||
|
||||
If we had put the MatchUnboundPVCs predicate at the end, then pod preemption
|
||||
wouldn’t have an accurate filtered nodes list, and could end up preempting pods
|
||||
on a Node that the higher priority pod still cannot run on due to PVC
|
||||
requirements. For that reason, the PVC binding decision needs to be have its
|
||||
predicate function separated out and evaluated with the rest of the predicates.
|
||||
|
||||
##### Pull entire PVC binding into the scheduler
|
||||
The proposed design only has the scheduler initiating the binding transaction
|
||||
by prebinding the PV. An alternative is to pull the whole two-way binding
|
||||
transaction into the scheduler, but there are some complex scenarios that
|
||||
scheduler’s Pod sync loop cannot handle:
|
||||
* PVC and PV getting unexpectedly unbound or lost
|
||||
* PVC and PV state getting partially updated
|
||||
* PVC and PV deletion and cleanup
|
||||
|
||||
Handling these scenarios in the scheduler’s Pod sync loop is not possible, so
|
||||
they have to remain in the PV controller.
|
||||
|
||||
##### Keep all PVC binding in the PV controller
|
||||
Instead of initiating PV binding in the scheduler, have the PV controller wait
|
||||
until the Pod has been scheduled to a Node, and then try to bind based on the
|
||||
chosen Node. A new scheduling predicate is still needed to filter and match
|
||||
the PVs (but not actually bind).
|
||||
|
||||
The advantages are:
|
||||
* Existing scenarios where scheduler is bypassed will work the same way.
|
||||
* Custom schedulers will continue to work the same way.
|
||||
* Most of the PV logic is still contained in the PV controller, simplifying HA
|
||||
upgrades.
|
||||
|
||||
Major downsides of this approach include:
|
||||
* Requires PV controller to watch Pods and potentially change its sync loop
|
||||
to operate on pods, in order to handle the multiple PVCs in a pod scenario.
|
||||
This is a potentially big change that would be hard to keep separate and
|
||||
feature-gated from the current PV logic.
|
||||
* Both scheduler and PV controller processes have to make the binding decision,
|
||||
but because they are done asynchronously, it is possible for them to choose
|
||||
different PVs. The scheduler has to cache its decision so that it won't choose
|
||||
the same PV for another PVC. But by the time PV controller handles that PVC,
|
||||
it could choose a different PV than the scheduler.
|
||||
* Recovering from this inconsistent decision and syncing the two caches is
|
||||
very difficult. The scheduler could have made a cascading sequence of decisions
|
||||
based on the first inconsistent decision, and they would all have to somehow be
|
||||
fixed based on the real PVC/PV state.
|
||||
* If the scheduler process restarts, it loses all its in-memory PV decisions and
|
||||
can make a lot of wrong decisions after the restart.
|
||||
* All the volume scheduler predicates that require PVC to be bound will not get
|
||||
evaluated. To solve this, all the volume predicates need to also be built into
|
||||
the PV controller when matching possible PVs.
|
||||
|
||||
##### Move PVC binding to kubelet
|
||||
Looking into the future, with the potential for NUMA-aware scheduling, you could
|
||||
have a sub-scheduler on each node to handle the pod scheduling within a node. It
|
||||
could make sense to have the volume binding as part of this sub-scheduler, to make
|
||||
sure that the volume selected will have NUMA affinity with the rest of the
|
||||
resources that the pod requested.
|
||||
|
||||
However, there are potential security concerns because kubelet would need to see
|
||||
unbound PVs in order to bind them. For local storage, the PVs could be restricted
|
||||
to just that node, but for zonal storage, it could see all the PVs in that zone.
|
||||
|
||||
In addition, the sub-scheduler is just a thought at this point, and there are no
|
||||
concrete proposals in this area yet.
|
||||
|
||||
### Binding multiple PVCs in one transaction
|
||||
TODO (beta): More details
|
||||
|
||||
For the alpha phase, this is not required. Since the scheduler is serialized, a
|
||||
partial binding failure should be a rare occurrence. Manual recovery will need
|
||||
to be done for those rare failures.
|
||||
|
||||
One approach to handle this is to rollback previously bound PVCs using PV taints
|
||||
and tolerations. The details will be handled as a separate feature.
|
||||
|
||||
For rollback, PersistentVolumes will have a new status to indicate if it's clean
|
||||
or dirty. For backwards compatibility, a nil value is defaulted to dirty. The
|
||||
PV controller will set the status to clean if the PV is Available and unbound.
|
||||
Kubelet will set the PV status to dirty during Pod admission, before adding the
|
||||
volume to the desired state.
|
||||
|
||||
Two new PV taints are available for the scheduler to use:
|
||||
* ScheduleFailClean, with a short default toleration.
|
||||
* ScheduleFailDirty, with an infinite default toleration.
|
||||
|
||||
If scheduling fails, update all bound PVs and add the appropriate taint
|
||||
depending on if the PV is clean or dirty. Then retry scheduling. It could be
|
||||
that the reason for the scheduling failure clears within the toleration period
|
||||
and the PVCs are able to be bound and scheduled.
|
||||
|
||||
If all PVs are bound, and have a ScheduleFail taint, but the toleration has not
|
||||
been exceeded, then remove the ScheduleFail taints.
|
||||
|
||||
If all PVs are bound, and none have ScheduleFail taints, then continue to
|
||||
schedule the pod.
|
||||
|
||||
### Recovering from kubelet rejection of pod
|
||||
TODO (beta): More details
|
||||
|
||||
Again, for alpha phase, manual intervention will be required.
|
||||
|
||||
For beta, we can use the same rollback mechanism as above to handle this case.
|
||||
If kubelet rejects a pod, it will go back to scheduling. If the scheduler
|
||||
cannot find a node for the pod, then it will encounter scheduling failure and
|
||||
taint the PVs with the ScheduleFail taint.
|
||||
|
||||
### Making dynamic provisioning topology aware
|
||||
TODO (beta): Design details
|
||||
|
||||
For alpha, we are not focusing on this use case. But it should be able to
|
||||
follow the new workflow closely with some modifications.
|
||||
* The FindUnboundPVCs predicate function needs to get provisionable capacity per
|
||||
topology dimension from the provisioner somehow.
|
||||
* The PrioritizeUnboundPVCs priority function can add a new priority score factor
|
||||
based on available capacity per node.
|
||||
* The BindUnboundPVCs bind function needs to pass in the node to the provisioner.
|
||||
The internal and external provisioning APIs need to be updated to take in a node
|
||||
parameter.
|
||||
|
||||
|
||||
## Testing
|
||||
|
||||
### E2E tests
|
||||
* StatefulSet, replicas=3, specifying pod anti-affinity
|
||||
* Positive: Local PVs on each of the nodes
|
||||
* Negative: Local PVs only on 2 out of the 3 nodes
|
||||
* StatefulSet specifying pod affinity
|
||||
* Positive: Multiple local PVs on a node
|
||||
* Negative: Only one local PV available per node
|
||||
* Multiple PVCs specified in a pod
|
||||
* Positive: Enough local PVs available on a single node
|
||||
* Negative: Not enough local PVs available on a single node
|
||||
* Fallback to dynamic provisioning if unsuitable static PVs
|
||||
* Existing PV tests need to be modified, if alpha, to check for PVC bound after
|
||||
pod has started.
|
||||
* Existing dynamic provisioning tests may need to be disabled for alpha if they
|
||||
don’t launch pods. Or extended to launch pods.
|
||||
* Existing prebinding tests should still work.
|
||||
|
||||
### Unit tests
|
||||
* All PVCs found a match on first node. Verify match is best suited based on
|
||||
capacity.
|
||||
* All PVCs found a match on second node. Verify match is best suited based on
|
||||
capacity.
|
||||
* Only 2 out of 3 PVCs have a match.
|
||||
* Priority scoring doesn’t change the given priorityList order.
|
||||
* Priority scoring changes the priorityList order.
|
||||
* Don’t match PVs that are prebound
|
||||
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Alpha
|
||||
* New feature gate for volume topology scheduling
|
||||
* Refactor PV controller methods into a common library
|
||||
* PV controller: Disable binding unbound PVCs if feature gate is set
|
||||
* Predicate: Filter nodes and find matching PVs
|
||||
* Predicate: Check if provisioner exists for dynamic provisioning
|
||||
* Update existing predicates to skip unbound PVC
|
||||
* Bind: Trigger PV binding
|
||||
* Bind: Trigger dynamic provisioning
|
||||
* Tests: Refactor all the tests that expect a PVC to be bound before scheduling
|
||||
a Pod (only if alpha is enabled)
|
||||
|
||||
### Beta
|
||||
* Scheduler cache: Optimizations for no PV node affinity
|
||||
* Priority: capacity match score
|
||||
* Bind: Handle partial binding failure
|
||||
* Plugins: Convert all zonal volume plugins to use new PV node affinity (GCE PD,
|
||||
AWS EBS, what else?)
|
||||
* Make dynamic provisioning topology aware
|
||||
|
||||
### GA
|
||||
* Predicate: Handle max PD per node limit
|
||||
* Scheduler cache: Optimizations for PV node affinity
|
||||
* Handle kubelet rejection of pod
|
||||
|
||||
|
||||
## Open Issues
|
||||
* Can generic device resource API be leveraged at all? Probably not, because:
|
||||
* It will only work for local storage (node specific devices), and not zonal
|
||||
storage.
|
||||
* Storage already has its own first class resources in K8s (PVC/PV) with an
|
||||
independent lifecycle. The current resource API proposal does not have an a way to
|
||||
specify identity/persistence for devices.
|
||||
* Will this be able to work with the node sub-scheduler design for NUMA-aware
|
||||
scheduling?
|
||||
* It’s still in a very early discussion phase.
|
Loading…
Reference in New Issue