5.2 KiB
Delay Pod Creation
@k82cn; Jan 7, 2019
Table of Contents
Created by gh-md-toc
Motivation
For a batch system, there're always several pending jobs because of limited resources and throughput. Different with other kubernetes type, e.g. Deployment, DaemonSet, it's better to delay pods creation for batch workload to reduce apiserver pressure and speed up scheduling (e.g. less pending pods to consider). In this document, several enhancements are introduced to delay pod creation.
Function Detail
State
A new state, named InQueue, will be introduced to denote the phase that jobs are ready to be allocated.
After InQueue, the state transform map is updated as follow.
| From | To | Reason |
|---|---|---|
| Pending | InQueue | When it's ready to allocate resource to job |
| InQueue | Pending | When there's not enough resources anymore |
| InQueue | Running | When every pods of spec.minMember are running |
The InQueue is a new state between Pending and Running; and it'll let operators/controllers start to
create pods. If it meets errors, e.g. unschedulable, it rollbacks to Pending instead of InQueue to
avoid retry-loop.
Action
Currently, kube-batch supports several actions, e.g. allocate, preempt; but all those actions are executed
based on pending pods. To support InQueue state, a new action, named enqueue, will be introduced.
By default, enqueue action will handle PodGroups in FCFS policy; enqueue will go through all PodGroup
(by creation timestamp) and update PodGroup's phase to InQueue if:
- there're enough idle resources for
spec.minResourcesofPodGroup - there're enough quota for
spec.minResourcesofPodGroup
As kube-batch handling PodGroup by spec.minResources, the operator/controller may create more Pods than
spec.minResources; in such case, preempt action will be enhanced to evict overused PodGroup to release
resources.
Admission Webhook
To guarantee the transaction of spec.minResources, a new MutatingAdmissionWebhook, named PodGroupMinResources,
is introduced. PodGroupMinResources make sure
- the summary of all PodGroups'
spec.minResourcesin a namespace not more thanQuota - if resources are reserved by
spec.minResources, the resources can not be used by others
Generally, it's better to let total Quota to be more than available resources in cluster, as some pods maybe
unschedulable because of scheduler's algorithm, e.g. predicates.
Feature interaction
Queue
The resources will be shared between Queues algorithm, e.g. proportion by default. If the resources can not be
fully used because of fragment, backfill action will help on that. If Queue used more resources than its
deserved, reclaim action will help to balance resources. The Pod can not be evicted currently if eviction will
break spec.minMember; it'll be enhanced for job level eviction.
Quota
To delay pod creation, both kube-batch and PodGroupMinResources will watch ResourceQuota to decide which
PodGroup should be in queue firstly. The decision maybe invalid because of race condition, e.g. other
controllers create Pods. In such case, PodGroupMinResources will reject PodGroup creation and keep InQueue
state until kube-batch transform it back to Pending. To avoid race condition, it's better to let kube-batch
manage Pod number and resources (e.g. CPU, memory) instead of Quota.
Operator/Controller
The Operator/Controller should follow the above "protocol" to work together with scheduler. A new component,
named PodGroupController, will be introduced later to enforce this protocol if necessary.
Others
Compatibility
To support this new feature, a new state and a new action are introduced; so when the new enqueue action is
disabled in the configuration, it'll keep the same behaviour as before.
Roadmap
InQueuephase andenqueueaction (v0.5+)- Admission Controller (v0.6+)