Merge pull request #1714 from k82cn/k8s_548
Design doc of 'Schedule DS Pod by default scheduler'.
This commit is contained in:
commit
1cc2e016b5
|
|
@ -0,0 +1,71 @@
|
|||
# Schedule DaemonSet Pods by default scheduler, not DaemonSet controller
|
||||
|
||||
[@k82cn](http://github.com/k82cn), Feb 2018, [#42002](https://github.com/kubernetes/kubernetes/issues/42002).
|
||||
|
||||
## Motivation
|
||||
|
||||
A DaemonSet ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to them. As nodes are removed from the cluster, those pods are garbage collected. Normally, the machine that a pod runs on is selected by the Kubernetes scheduler; however, pods of DaemonSet are created and scheduled by DaemonSet controller who leveraged kube-scheduler’s predicates policy. That introduces the following issues:
|
||||
|
||||
* DaemonSet can not respect Node’s resource changes, e.g. more resources after other Pods exit ([#46935](https://github.com/kubernetes/kubernetes/issues/46935), [#58868](https://github.com/kubernetes/kubernetes/issues/58868))
|
||||
* DaemonSet can not respect Pod Affinity and Pod AntiAffinity ([#29276](https://github.com/kubernetes/kubernetes/issues/29276))
|
||||
* Duplicated logic to respect scheduler features, e.g. critical pods ([#42028](https://github.com/kubernetes/kubernetes/issues/42028)), tolerant/taint
|
||||
* Hard to debug why DaemonSet’s Pod is not created, e.g. not enough resources; it’s better to have a pending Pods with predicates’ event
|
||||
* Hard to support preemption in different components, e.g. DS and default scheduler
|
||||
|
||||
After [discussions](https://docs.google.com/document/d/1v7hsusMaeImQrOagktQb40ePbK6Jxp1hzgFB9OZa_ew/edit#), SIG scheduling approved changing DaemonSet controller to create DaemonSet Pods and set their node-affinity and let them be scheduled by default scheduler. After this change, DaemonSet controller will no longer schedule DaemonSet Pods directly.
|
||||
|
||||
## Solutions
|
||||
|
||||
Before the discussion of solutions/options, there’s some requirements/questions on DaemonSet:
|
||||
|
||||
* **Q**: DaemonSet controller can make pods even if the network of node is unavailable, e.g. CNI network providers (Calico, Flannel),
|
||||
Will this impact bootstrapping, such as in the case that a DaemonSet is being used to provide the pod network?
|
||||
|
||||
**A**: This will be handled by supporting scheduling tolerating workloads on NotReady Nodes ([#45717](https://github.com/kubernetes/kubernetes/issues/45717)); after moving to check node’s taint, the DaemonSet pods will tolerate `NetworkUnavailable` taint.
|
||||
|
||||
* **Q**: DaemonSet controller can make pods even if when the scheduler has not been started, which can help cluster bootstrap.
|
||||
|
||||
**A**: As the scheduling logic is moved to default scheduler, the kube-scheduler must be started during cluster start-up.
|
||||
|
||||
* **Q**: Will this change/constrain update strategies, such as scheduling an updated pod to a node before the previous pod is gone?
|
||||
|
||||
**A**: no, this will NOT change update strategies.
|
||||
|
||||
* **Q**: How would Daemons be integrated into Node lifecycle, such as being scheduled before any other nodes and/or remaining after all others are evicted? This isn't currently implemented, but was planned.
|
||||
|
||||
**A**: Similar to the other Pods; DaemonSet Pods only has attributes to make sure one Pod per Node, DaemonSet controller will create Pods based on node number (by considering ‘nodeSelector’).
|
||||
|
||||
|
||||
Currently, pods of DaemonSet are created and scheduled by DaemonSet controller:
|
||||
|
||||
1. DS controller filter nodes by nodeSelector and scheduler’s predicates
|
||||
2. For each node, create a Pod for it by setting spec.hostName directly; it’ll skip default scheduler
|
||||
|
||||
This option is to leverage NodeAffinity feature to avoid introducing scheduler’s predicates in DS controller:
|
||||
|
||||
1. DS controller filter nodes by nodeSelector, but does NOT check against scheduler’s predicates (e.g. PodFitHostResources)
|
||||
2. For each node, DS controller creates a Pod for it with the following NodeAffinity
|
||||
3. When sync Pods, DS controller will map nodes and pods by this NodeAffinity to check whether Pods are started for nodes
|
||||
4. In scheduler, DaemonSet Pods will stay pending if scheduling predicates fail. To avoid this, an appropriate priority must
|
||||
be set to all critical DaemonSet Pods. Scheduler will preempt other pods to ensure critical pods were scheduled even when
|
||||
the cluster is under resource pressure.
|
||||
|
||||
```yaml
|
||||
nodeAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
- nodeSelectorTerms:
|
||||
matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: in
|
||||
values:
|
||||
- dest_hostname
|
||||
```
|
||||
|
||||
## Reference
|
||||
|
||||
* [DaemonsetController can't feel it when node has more resources, e.g. other Pod exits](https://github.com/kubernetes/kubernetes/issues/46935)
|
||||
* [DaemonsetController can't feel it when node recovered from outofdisk state](https://github.com/kubernetes/kubernetes/issues/45628)
|
||||
* [DaemonSet pods should be scheduled by default scheduler, not DaemonSet controller](https://github.com/kubernetes/kubernetes/issues/42002)
|
||||
* [NodeController should add NoSchedule taints and we should get rid of getNodeConditionPredicate()](https://github.com/kubernetes/kubernetes/issues/42001)
|
||||
* [DaemonSet should respect Pod Affinity and Pod AntiAffinity](https://github.com/kubernetes/kubernetes/issues/29276)
|
||||
* [Make DaemonSet respect critical pods annotation when scheduling](https://github.com/kubernetes/kubernetes/pull/42028)
|
||||
Loading…
Reference in New Issue