Merge pull request #1054 from msau42/storage-topology-design

Automatic merge from submit-queue. Volume topology aware scheduling design Proposal for a smarter scheduler that influences PV binding. Part of kubernetes/features#121 /sig storage /sig scheduling /cc @kubernetes/sig-storage-proposals @kubernetes/sig-scheduling-proposals
2017-10-04 14:15:24 -07:00 · 2017-10-04 14:15:24 -07:00 · f20eb8c175
parent c6db6dbe9d 6e486a390f
commit f20eb8c175
1 changed files with 672 additions and 0 deletions
--- a/contributors/design-proposals/storage/volume-topology-scheduling.md
+++ b/contributors/design-proposals/storage/volume-topology-scheduling.md
@ -0,0 +1,672 @@
+# Volume Topology-aware Scheduling
+
+Authors: @msau42
+
+This document presents a detailed design for making the default Kubernetes
+scheduler aware of volume topology constraints, and making the
+PersistentVolumeClaim (PVC) binding aware of scheduling decisions.
+
+
+## Goals
+* Allow a Pod to request one or more topology-constrained Persistent
+Volumes (PV) that are compatible with the Pod's other scheduling
+constraints, such as resource requirements and affinity/anti-affinity
+policies.
+* Support arbitrary PV topology constraints (i.e. node,
+rack, zone, foo, bar).
+* Support topology constraints for statically created PVs and dynamically
+provisioned PVs.
+* No scheduling latency performance regression for Pods that do not use
+topology-constrained PVs.
+
+
+## Non Goals
+* Fitting a pod after the initial PVC binding has been completed.
+    * The more constraints you add to your pod, the less flexible it becomes
+in terms of placement.  Because of this, tightly constrained storage, such as
+local storage, is only recommended for specific use cases, and the pods should
+have higher priority in order to preempt lower priority pods from the node.
+* Binding decision considering scheduling constraints from two or more pods
+sharing the same PVC.
+    * The scheduler itself only handles one pod at a time.  It’s possible the
+two pods may not run at the same time either, so there’s no guarantee that you
+will know both pod’s requirements at once.
+    * For two+ pods simultaneously sharing a PVC, this scenario may require an
+operator to schedule them together.  Another alternative is to merge the two
+pods into one.
+    * For two+ pods non-simultaneously sharing a PVC, this scenario could be
+handled by pod priorities and preemption.
+
+
+## Problem
+Volumes can have topology constraints that restrict the set of nodes that the
+volume can be accessed on.  For example, a GCE PD can only be accessed from a
+single zone, and a local disk can only be accessed from a single node.  In the
+future, there could be other topology constraints, such as rack or region.
+
+A pod that uses such a volume must be scheduled to a node that fits within the
+volume’s topology constraints.  In addition, a pod can have further constraints
+and limitations, such as the pod’s resource requests (cpu, memory, etc), and
+pod/node affinity and anti-affinity policies.
+
+Currently, the process of binding and provisioning volumes are done before a pod
+is scheduled.  Therefore, it cannot take into account any of the pod’s other
+scheduling constraints.  This makes it possible for the PV controller to bind a
+PVC to a PV or provision a PV with constraints that can make a pod unschedulable.
+
+### Examples
+* In multizone clusters, the PV controller has a hardcoded heuristic to provision
+PVCs for StatefulSets spread across zones.  If that zone does not have enough
+cpu/memory capacity to fit the pod, then the pod is stuck in pending state because
+its volume is bound to that zone.
+* Local storage exasperates this issue.  The chance of a node not having enough
+cpu/memory is higher than the chance of a zone not having enough cpu/memory.
+* Local storage PVC binding does not have any node spreading logic.  So local PV
+binding will very likely conflict with any pod anti-affinity policies if there is
+more than one local PV on a node.
+* A pod may need multiple PVCs.  As an example, one PVC can point to a local SSD for
+fast data access, and another PVC can point to a local HDD for logging.  Since PVC
+binding happens without considering if multiple PVCs are related, it is very likely
+for the two PVCs to be bound to local disks on different nodes, making the pod
+unschedulable.
+* For multizone clusters and deployments requesting multiple dynamically provisioned
+zonal PVs, each PVC Is provisioned independently, and is likely to provision each PV
+In different zones, making the pod unschedulable.
+
+To solve the issue of initial volume binding and provisioning causing an impossible
+pod placement, volume binding and provisioning should be more tightly coupled with
+pod scheduling.
+
+
+## Background
+In 1.7, we added alpha support for [local PVs](local-storage-pv) with node affinity.
+You can specify a PV object with node affinity, and if a pod is using such a PV,
+the scheduler will evaluate the PV node affinity in addition to the other
+scheduling predicates.  So far, the PV node affinity only influences pod
+scheduling once the PVC is already bound.  The initial PVC binding decision was
+unchanged.  This proposal addresses the initial PVC binding decision.
+
+
+## Design
+The design can be broken up into a few areas:
+* User-facing API to invoke new behavior
+* Integrating PV binding with pod scheduling
+* Binding multiple PVCs as a single transaction
+* Recovery from kubelet rejection of pod
+* Making dynamic provisioning topology-aware
+
+For the alpha phase, only the user-facing API and PV binding and scheduler
+integration are necessary.  The remaining areas can be handled in beta and GA
+phases.
+
+### User-facing API
+This new binding behavior will apply to all unbound PVCs and can be controlled
+through install-time configuration passed to the scheduler and controller-manager.
+
+In alpha, this configuration can be controlled by a feature gate,
+`VolumeTopologyScheduling`.
+
+#### Alternative
+An alternative approach is to only trigger the new behavior only for volumes that
+have topology constraints.  A new annotation can be added to the StorageClass to
+indicate that its volumes have constraints and will need to use the new
+topology-aware scheduling logic.  The PV controller checks for this annotation
+every time it handles an unbound PVC, and will delay binding if it's set.
+
+```
+// Value is “true” or “false”.  Default is false.
+const AlphaVolumeTopologySchedulingAnnotation = “volume.alpha.kubernetes.io/topology-scheduling”
+```
+
+While this approach can let us introduce the new behavior gradually, it has a
+few downsides:
+* StorageClass will be required to use this new logic, even if dynamic
+provisioning is not used (in the case of local storage).
+* We have to maintain two different paths for volume binding.
+* We will be depending on the storage admin to correctly configure the
+StorageClasses.
+
+For those reasons, we prefer to switch the behavior for all unbound PVCs and
+clearly communicate these changes to the community before changing the behavior
+by default in GA.
+
+### Integrating binding with scheduling
+For the alpha phase, the focus is on static provisioning of PVs to support
+persistent local storage.
+
+TODO: flow chart
+
+The proposed new workflow for volume binding is:
+1. Admin statically creates PVs and/or StorageClasses.
+2. User creates unbound PVC and there are no prebound PVs for it.
+3. **NEW:** PVC binding and provisioning is delayed until a pod is created that
+references it.
+4. User creates a pod that uses the PVC.
+5. Pod starts to get processed by the scheduler.
+6. **NEW:** A new predicate function, called MatchUnboundPVCs, will look at all of
+a Pod’s unbound PVCs, and try to find matching PVs for that node based on the
+PV topology.  If there are no matching PVs, then it checks if dynamic
+provisioning is possible for that node.
+7. **NEW:** The scheduler continues to evaluate priorities.  A new priority
+function, called PrioritizeUnboundPVCs, will get the PV matches per PVC per
+node, and compute a priority score based on various factors.
+8. **NEW:** After evaluating all the existing predicates and priorities, the
+scheduler will pick a node, and call a new assume function, AssumePVCs,
+passing in the Node.  The assume function will check if any binding or
+provisioning operations need to be done.  If so, it will update the PV cache to
+mark the PVs with the chosen PVCs.
+9. **NEW:** If PVC binding or provisioning is required, we do NOT AssumePod.
+Instead, a new bind function, BindPVCs, will be called asynchronously, passing
+in the selected node.  The bind function will prebind the PV to the PVC, or
+trigger dynamic provisioning, and then wait until the PVCs are bound
+successfully or encounter failure.  Then, it always sends the Pod through the
+scheduler again for reasons explained later.
+10. When a Pod makes a successful scheduler pass once all PVCs are bound, the
+scheduler assumes and binds the Pod to a Node.
+11. Kubelet starts the Pod.
+
+This new workflow will have the scheduler handle unbound PVCs by choosing PVs
+and prebinding them to the PVCs.  The PV controller completes the binding
+transaction, handling it as a prebound PV scenario.
+
+Prebound PVCs and PVs will still immediately be bound by the PV controller.
+
+One important point to note is that for the alpha phase, manual recovery is
+required in following error conditions:
+* A Pod has multiple PVCs, and only a subset of them successfully bind.
+* The scheduler chose a PV and prebound it, but the PVC could not be bound.
+
+In both of these scenarios, the primary cause for these errors is if a user
+prebinds other PVCs to a PV that the scheduler chose.  Some workarounds to
+avoid these error conditions are to:
+* Prebind the PV instead.
+* Separate out volumes that the user prebinds from the volumes that are
+available for the system to choose from by StorageClass.
+
+#### PV Controller Changes
+When the feature gate is enabled, the PV controller needs to skip binding and
+provisioning all unbound PVCs that don’t have prebound PVs, and let it come
+through the scheduler path.  This is the block in `syncUnboundClaim` where
+`claim.Spec.VolumeName == “”` and there is no PV that is prebound to the PVC.
+
+No other state machine changes are required.  The PV controller continues to
+handle the remaining scenarios without any change.
+
+The methods to find matching PVs for a claim, prebind PVs, and to invoke and
+handle dynamic provisioning need to be refactored for use by the new scheduler
+functions.
+
+#### Scheduler Changes
+
+##### Predicate
+A new predicate function checks all of a Pod's unbound PVCs can be satisfied
+by existing PVs or dynamically provisioned PVs that are
+topologically-constrained to the Node.
+```
+MatchUnboundPVCs(pod *v1.Pod, node *v1.Node) (canBeBound bool, err error)
+```
+1. If all the Pod’s PVCs are bound, return true.
+2. Otherwise try to find matching PVs for all of the unbound PVCs in order of
+decreasing requested capacity.
+3. Walk through all the PVs.
+4. Find best matching PV for the PVC where PV topology is satisfied by the Node.
+5. Temporarily cache this PV in the PVC object, keyed by Node, for fast
+processing later in the priority and bind functions.
+6. Return true if all PVCs are matched.
+7. If there are still unmatched PVCs, check if dynamic provisioning is possible.
+For this alpha phase, the provisioner is not topology aware, so the predicate
+will just return true if there is a provisioner specified in the StorageClass
+(internal or external).
+8. Otherwise return false.
+
+TODO: caching format and details
+
+##### Priority
+After all the predicates run, there is a reduced set of Nodes that can fit a
+Pod. A new priority function will rank the remaining nodes based on the
+unbound PVCs and their matching PVs.
+```
+PrioritizeUnboundPVCs(pod *v1.Pod, filteredNodes HostPriorityList) (rankedNodes HostPriorityList, err error)
+```
+1. For each Node, get the cached PV matches for the Pod’s PVCs.
+2. Compute a priority score for the Node using the following factors:
+    1. How close the PVC’s requested capacity and PV’s capacity are.
+    2. Matching static PVs is preferred over dynamic provisioning because we
+       assume that the administrator has specifically created these PVs for
+       the Pod.
+
+TODO (beta): figure out weights and exact calculation
+
+##### Assume
+Once all the predicates and priorities have run, then the scheduler picks a
+Node.  Then we can bind or provision PVCs for that Node.  For better scheduler
+performance, we’ll assume that the binding will likely succeed, and update the
+PV cache first.  Then the actual binding API update will be made
+asynchronously, and the scheduler can continue processing other Pods.
+
+For the alpha phase, the AssumePVCs function will be directly called by the
+scheduler.  We’ll consider creating a generic scheduler interface in a
+subsequent phase.
+
+```
+AssumePVCs(pod *v1.Pod, node *v1.Node) (pvcBindingRequired bool, err error)
+```
+1. If all the Pod’s PVCs are bound, return false.
+2. For static PV binding:
+    1. Get the cached matching PVs for the PVCs on that Node.
+    2. Validate the actual PV state.
+    3. Mark PV.ClaimRef in the PV cache.
+3. For in-tree and external dynamic provisioning:
+    1. Nothing.
+4. Return true.
+
+
+##### Bind
+If AssumePVCs returns pvcBindingRequired, then the BindPVCs function is called
+as a go routine.  Otherwise, we can continue with assuming and binding the Pod
+to the Node.
+
+For the alpha phase, the BindUnboundPVCs function will be directly called by the
+scheduler.  We’ll consider creating a generic scheduler interface in a subsequent
+phase.
+
+```
+BindUnboundPVCs(pod *v1.Pod, node *v1.Node) (err error)
+```
+1. For static PV binding:
+    1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
+    2. If the prebind fails, revert the cache updates.
+    3. Otherwise, wait for the PVCs to be bound, PVC/PV object is deleted, or
+PV.ClaimRef field is cleared
+2. For in-tree dynamic provisioning:
+    1. Make the provision call, which will create a new PV object that is
+prebound to the PVC.
+    2. If provisioning is successful, wait for the PVCs to be bound, PVC/PV
+object is deleted, or PV.ClaimRef field is cleared
+3. For external provisioning:
+    1. Set the annotation for external provisioners.
+4. Send Pod back through scheduling, regardless of success or failure.
+    1. In the case of success, we need one more pass through the scheduler in
+order to evaluate other volume predicates that require the PVC to be bound, as
+described below.
+    2. In the case of failure, we want to retry binding/provisioning.
+
+Note that for external provisioning, we do not wait for the PVCs to be bound, so
+the Pod will be sent through scheduling repeatedly until the PVCs are bound.
+This is because there is no function call for external provisioning, so if we
+did wait, we could be waiting forever for the PVC to bind.  It’s possible that
+in the meantime, a user could create a PV that satisfies the PVC and doesn’t
+need the external provisioner anymore.
+
+TODO: pv controller has a high resync frequency, do we need something similar
+for the scheduler too
+
+##### Pod preemption considerations
+The MatchUnboundPVs predicate does not need to be re-evaluated for pod
+preemption.  Preempting a pod that uses a PV will not free up capacity on that
+node because the PV lifecycle is independent of the Pod’s lifecycle.
+
+##### Other scheduler predicates
+Currently, there are a few existing scheduler predicates that require the PVC
+to be bound.  The bound assumption needs to be changed in order to work with
+this new workflow.
+
+TODO: how to handle race condition of PVCs becoming bound in the middle of
+running predicates?  One possible way is to mark at the beginning of scheduling
+a Pod if all PVCs were bound.  Then we can check if a second scheduler pass is
+needed.
+
+###### Max PD Volume Count Predicate
+This predicate checks the maximum number of PDs per node is not exceeded.  It
+needs to be integrated into the binding decision so that we don’t bind or
+provision a PV if it’s going to cause the node to exceed the max PD limit.  But
+until it is integrated, we need to make one more pass in the scheduler after all
+the PVCs are bound.  The current copy of the predicate in the default scheduler
+has to remain to account for the already-bound volumes.
+
+###### Volume Zone Predicate
+This predicate makes sure that the zone label on a PV matches the zone label of
+the node.  If the volume is not bound, this predicate can be ignored, as the
+binding logic will take into account zone constraints on the PV.
+
+However, this assumes that zonal PVs like GCE PDs and AWS EBS have been updated
+to use the new PV topology specification, which is not the case as of 1.8.  So
+until those plugins are updated, the binding and provisioning decisions will be
+topology-unaware, and we need to make one more pass in the scheduler after all
+the PVCs are bound.
+
+This predicate needs to remain in the default scheduler to handle the
+already-bound volumes using the old zonal labeling.  It can be removed once that
+mechanism is deprecated and unsupported.
+
+###### Volume Node Predicate
+This is a new predicate added in 1.7 to handle the new PV node affinity.  It
+evaluates the node affinity against the node’s labels to determine if the pod
+can be scheduled on that node.  If the volume is not bound, this predicate can
+be ignored, as the binding logic will take into account the PV node affinity.
+
+#### Performance and Optimizations
+Let:
+* N = number of nodes
+* V = number of all PVs
+* C = number of claims in a pod
+
+C is expected to be very small (< 5) so shouldn’t factor in.
+
+The current PV binding mechanism just walks through all the PVs once, so its
+running time O(V).
+
+Without any optimizations, the new PV binding mechanism has to run through all
+PVs for every node, so its running time is O(NV).
+
+A few optimizations can be made to improve the performance:
+
+1. Optimizing for PVs that don’t use node affinity (to prevent performance
+regression):
+    1. Index the PVs by StorageClass and only search the PV list with matching
+StorageClass.
+    2. Keep temporary state in the PVC cache if we previously succeeded or
+failed to match PVs, and if none of the PVs have node affinity.  Then we can
+skip PV matching on subsequent nodes, and just return the result of the first
+attempt.
+2. Optimizing for PVs that have node affinity:
+    1. When a static PV is created, if node affinity is present, evaluate it
+against all the nodes.  For each node, keep an in-memory map of all its PVs
+keyed by StorageClass.  When finding matching PVs for a particular node, try to
+match against the PVs in the node’s PV map instead of the cluster-wide PV list.
+
+For the alpha phase, the optimizations are not required.  However, they should
+be required for beta and GA.
+
+#### Packaging
+The new bind logic that is invoked by the scheduler can be packaged in a few
+ways:
+* As a library to be directly called in the default scheduler
+* As a scheduler extender
+
+We propose taking the library approach, as this method is simplest to release
+and deploy.  Some downsides are:
+* The binding logic will be executed using two different caches, one in the
+scheduler process, and one in the PV controller process.  There is the potential
+for more race conditions due to the caches being out of sync.
+* Refactoring the binding logic into a common library is more challenging
+because the scheduler’s cache and PV controller’s cache have different interfaces
+and private methods.
+
+##### Extender cons
+However, the cons of the extender approach outweighs the cons of the library
+approach.
+
+With an extender approach, the PV controller could implement the scheduler
+extender HTTP endpoint, and the advantage is the binding logic triggered by the
+scheduler can share the same caches and state as the PV controller.
+
+However, deployment of this scheduler extender in a master HA configuration is
+extremely complex.  The scheduler has to be configured with the hostname or IP of
+the PV controller.  In a HA setup, the active scheduler and active PV controller
+could run on the same, or different node, and the node can change at any time.
+Exporting a network endpoint in the controller manager process is unprecedented
+and there would be many additional features required, such as adding a mechanism
+to get a stable network name, adding authorization and access control, and
+dealing with DDOS attacks and other potential security issues.  Adding to those
+challenges is the fact that there are countless ways for users to deploy
+Kubernetes.
+
+With all this complexity, the library approach is the most feasible in a single
+release time frame, and aligns better with the current Kubernetes architecture.
+
+#### Downsides/Problems
+
+##### Expanding Scheduler Scope
+This approach expands the role of the Kubernetes default scheduler to also
+schedule PVCs, and not just Pods.  This breaks some fundamental design
+assumptions in the scheduler, primarily that if `Pod.Spec.NodeName`
+is set, then the Pod is scheduled.  Users can set the `NodeName` field directly,
+and controllers like DaemonSets also currently bypass the scheduler.
+
+For this approach to be fully functional, we need to change the scheduling
+criteria to also include unbound PVCs.
+
+TODO: details
+
+A general extension mechanism is needed to support scheduling and binding other
+objects in the future.
+
+##### Impact on Custom Schedulers
+This approach is going to make it harder to run custom schedulers, controllers and
+operators if you use PVCs and PVs. This adds a requirement to schedulers
+that they also need to make the PV binding decision.
+
+There are a few ways to mitigate the impact:
+* Custom schedulers could be implemented through the scheduler extender
+interface.  This allows the default scheduler to be run in addition to the
+custom scheduling logic.
+* The new code for this implementation will be packaged as a library to make it
+easier for custom schedulers to include in their own implementation.
+* Ample notice of feature/behavioral deprecation will be given, with at least
+the amount of time defined in the Kubernetes deprecation policy.
+
+In general, many advanced scheduling features have been added into the default
+scheduler, such that it is becoming more difficult to run custom schedulers with
+the new features.
+
+##### HA Master Upgrades
+HA masters adds a bit of complexity to this design because the active scheduler
+process and active controller-manager (PV controller) process can be on different
+nodes.  That means during an HA master upgrade, the scheduler and controller-manager
+can be on different versions.
+
+The scenario where the scheduler is newer than the PV controller is fine.  PV
+binding will not be delayed and in successful scenarios, all PVCs will be bound
+before coming to the scheduler.
+
+However, if the PV controller is newer than the scheduler, then PV binding will
+be delayed, and the scheduler does not have the logic to choose and prebind PVs.
+That will cause PVCs to remain unbound and the Pod will remain unschedulable.
+
+TODO: One way to solve this is to have some new mechanism to feature gate system
+components based on versions.  That way, the new feature is not turned on until
+all dependencies are at the required versions.
+
+For alpha, this is not concerning, but it needs to be solved by GA when the new
+functionality is enabled by default.
+
+
+#### Other Alternatives Considered
+
+##### One scheduler function
+An alternative design considered was to do the predicate, priority and bind
+functions all in one function at the end right before Pod binding, in order to
+reduce the number of passes we have to make over all the PVs.  However, this
+design does not work well with pod preemption.  Pod preemption needs to be able
+to evaluate if evicting a lower priority Pod will make a higher priority Pod
+schedulable, and it does this by re-evaluating predicates without the lower
+priority Pod.
+
+If we had put the MatchUnboundPVCs predicate at the end, then pod preemption
+wouldn’t have an accurate filtered nodes list, and could end up preempting pods
+on a Node that the higher priority pod still cannot run on due to PVC
+requirements.  For that reason, the PVC binding decision needs to be have its
+predicate function separated out and evaluated with the rest of the predicates.
+
+##### Pull entire PVC binding into the scheduler
+The proposed design only has the scheduler initiating the binding transaction
+by prebinding the PV.  An alternative is to pull the whole two-way binding
+transaction into the scheduler, but there are some complex scenarios that
+scheduler’s Pod sync loop cannot handle:
+* PVC and PV getting unexpectedly unbound or lost
+* PVC and PV state getting partially updated
+* PVC and PV deletion and cleanup
+
+Handling these scenarios in the scheduler’s Pod sync loop is not possible, so
+they have to remain in the PV controller.
+
+##### Keep all PVC binding in the PV controller
+Instead of initiating PV binding in the scheduler, have the PV controller wait
+until the Pod has been scheduled to a Node, and then try to bind based on the
+chosen Node.  A new scheduling predicate is still needed to filter and match
+the PVs (but not actually bind).
+
+The advantages are:
+* Existing scenarios where scheduler is bypassed will work the same way.
+* Custom schedulers will continue to work the same way.
+* Most of the PV logic is still contained in the PV controller, simplifying HA
+upgrades.
+
+Major downsides of this approach include:
+* Requires PV controller to watch Pods and potentially change its sync loop
+to operate on pods, in order to handle the multiple PVCs in a pod scenario.
+This is a potentially big change that would be hard to keep separate and
+feature-gated from the current PV logic.
+* Both scheduler and PV controller processes have to make the binding decision,
+but because they are done asynchronously, it is possible for them to choose
+different PVs.  The scheduler has to cache its decision so that it won't choose
+the same PV for another PVC.  But by the time PV controller handles that PVC,
+it could choose a different PV than the scheduler.
+    * Recovering from this inconsistent decision and syncing the two caches is
+very difficult.  The scheduler could have made a cascading sequence of decisions
+based on the first inconsistent decision, and they would all have to somehow be
+fixed based on the real PVC/PV state.
+* If the scheduler process restarts, it loses all its in-memory PV decisions and
+can make a lot of wrong decisions after the restart.
+* All the volume scheduler predicates that require PVC to be bound will not get
+evaluated.  To solve this, all the volume predicates need to also be built into
+the PV controller when matching possible PVs.
+
+##### Move PVC binding to kubelet
+Looking into the future, with the potential for NUMA-aware scheduling, you could
+have a sub-scheduler on each node to handle the pod scheduling within a node.  It
+could make sense to have the volume binding as part of this sub-scheduler, to make
+sure that the volume selected will have NUMA affinity with the rest of the
+resources that the pod requested.
+
+However, there are potential security concerns because kubelet would need to see
+unbound PVs in order to bind them.  For local storage, the PVs could be restricted
+to just that node, but for zonal storage, it could see all the PVs in that zone.
+
+In addition, the sub-scheduler is just a thought at this point, and there are no
+concrete proposals in this area yet.
+
+### Binding multiple PVCs in one transaction
+TODO (beta): More details
+
+For the alpha phase, this is not required.  Since the scheduler is serialized, a
+partial binding failure should be a rare occurrence.  Manual recovery will need
+to be done for those rare failures.
+
+One approach to handle this is to rollback previously bound PVCs using PV taints
+and tolerations.  The details will be handled as a separate feature.
+
+For rollback, PersistentVolumes will have a new status to indicate if it's clean
+or dirty.  For backwards compatibility, a nil value is defaulted to dirty.  The
+PV controller will set the status to clean if the PV is Available and unbound.
+Kubelet will set the PV status to dirty during Pod admission, before adding the
+volume to the desired state.
+
+Two new PV taints are available for the scheduler to use:
+* ScheduleFailClean, with a short default toleration.
+* ScheduleFailDirty, with an infinite default toleration.
+
+If scheduling fails, update all bound PVs and add the appropriate taint
+depending on if the PV is clean or dirty.  Then retry scheduling.  It could be
+that the reason for the scheduling failure clears within the toleration period
+and the PVCs are able to be bound and scheduled.
+
+If all PVs are bound, and have a ScheduleFail taint, but the toleration has not
+been exceeded, then remove the ScheduleFail taints.
+
+If all PVs are bound, and none have ScheduleFail taints, then continue to
+schedule the pod.
+
+### Recovering from kubelet rejection of pod
+TODO (beta): More details
+
+Again, for alpha phase, manual intervention will be required.
+
+For beta, we can use the same rollback mechanism as above to handle this case.
+If kubelet rejects a pod, it will go back to scheduling.  If the scheduler
+cannot find a node for the pod, then it will encounter scheduling failure and
+taint the PVs with the ScheduleFail taint.
+
+### Making dynamic provisioning topology aware
+TODO (beta): Design details
+
+For alpha, we are not focusing on this use case.  But it should be able to
+follow the new workflow closely with some modifications.
+* The FindUnboundPVCs predicate function needs to get provisionable capacity per
+topology dimension from the provisioner somehow.
+* The PrioritizeUnboundPVCs priority function can add a new priority score factor
+based on available capacity per node.
+* The BindUnboundPVCs bind function needs to pass in the node to the provisioner.
+The internal and external provisioning APIs need to be updated to take in a node
+parameter.
+
+
+## Testing
+
+### E2E tests
+* StatefulSet, replicas=3, specifying pod anti-affinity
+    * Positive: Local PVs on each of the nodes
+    * Negative: Local PVs only on 2 out of the 3 nodes
+* StatefulSet specifying pod affinity
+    * Positive: Multiple local PVs on a node
+    * Negative: Only one local PV available per node
+* Multiple PVCs specified in a pod
+    * Positive: Enough local PVs available on a single node
+    * Negative: Not enough local PVs available on a single node
+* Fallback to dynamic provisioning if unsuitable static PVs
+* Existing PV tests need to be modified, if alpha, to check for PVC bound after
+pod has started.
+* Existing dynamic provisioning tests may need to be disabled for alpha if they
+don’t launch pods.  Or extended to launch pods.
+* Existing prebinding tests should still work.
+
+### Unit tests
+* All PVCs found a match on first node.  Verify match is best suited based on
+capacity.
+* All PVCs found a match on second node.  Verify match is best suited based on
+capacity.
+* Only 2 out of 3 PVCs have a match.
+* Priority scoring doesn’t change the given priorityList order.
+* Priority scoring changes the priorityList order.
+* Don’t match PVs that are prebound
+
+
+## Implementation Plan
+
+### Alpha
+* New feature gate for volume topology scheduling
+* Refactor PV controller methods into a common library
+* PV controller: Disable binding unbound PVCs if feature gate is set
+* Predicate: Filter nodes and find matching PVs
+* Predicate: Check if provisioner exists for dynamic provisioning
+* Update existing predicates to skip unbound PVC
+* Bind: Trigger PV binding
+* Bind: Trigger dynamic provisioning
+* Tests: Refactor all the tests that expect a PVC to be bound before scheduling
+a Pod (only if alpha is enabled)
+
+### Beta
+* Scheduler cache: Optimizations for no PV node affinity
+* Priority: capacity match score
+* Bind: Handle partial binding failure
+* Plugins: Convert all zonal volume plugins to use new PV node affinity (GCE PD,
+AWS EBS, what else?)
+* Make dynamic provisioning topology aware
+
+### GA
+* Predicate: Handle max PD per node limit
+* Scheduler cache: Optimizations for PV node affinity
+* Handle kubelet rejection of pod
+
+
+## Open Issues
+* Can generic device resource API be leveraged at all?  Probably not, because:
+    * It will only work for local storage (node specific devices), and not zonal
+storage.
+    * Storage already has its own first class resources in K8s (PVC/PV) with an
+independent lifecycle.  The current resource API proposal does not have an a way to
+specify identity/persistence for devices.
+* Will this be able to work with the node sub-scheduler design for NUMA-aware
+scheduling?
+    * It’s still in a very early discussion phase.