volcano/docs/design/delay-pod-creation.md

# Delay Pod Creation

@k82cn; Jan 7, 2019

## Table of Contents

   * [Delay Pod Creation](#delay-pod-creation)
      * [Table of Contents](#table-of-contents)
      * [Motivation](#motivation)
      * [Function Detail](#function-detail)
         * [State](#state)
         * [Action](#action)
         * [Admission Webhook](#admission-webhook)
      * [Feature interaction](#feature-interaction)
         * [Queue](#queue)
         * [Quota](#quota)
         * [Operator/Controller](#operatorcontroller)
      * [Others](#others)
         * [Compatibility](#compatibility)
      * [Roadmap](#roadmap)
      * [Reference](#reference)

Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)

## Motivation

For a batch system, there're always several pending jobs because of limited resources and throughput.
Different with other kubernetes type, e.g. Deployment, DaemonSet, it's better to delay pods creation for
batch workload to reduce apiserver pressure and speed up scheduling (e.g. less pending pods to consider).
In this document, several enhancements are introduced to delay pod creation.

## Function Detail

### State

A new state, named `InQueue`, will be introduced to denote the phase that jobs are ready to be allocated.
After `InQueue`, the state transform map is updated as follow.

| From          | To             | Reason  |
|---------------|----------------|---------|
| Pending       | InQueue        | When it's ready to allocate resource to job |
| InQueue       | Pending        | When there's not enough resources anymore |
| InQueue       | Running        | When every pods of `spec.minMember` are running |

The `InQueue` is a new state between `Pending` and `Running`; and it'll let operators/controllers start to
create pods. If it meets errors, e.g. unschedulable, it rollbacks to `Pending` instead of `InQueue` to
avoid retry-loop.

### Action

Currently, `kube-batch` supports several actions, e.g. `allocate`, `preempt`; but all those actions are executed
based on pending pods. To support `InQueue` state, a new action, named `enqueue`, will be introduced.

By default, `enqueue` action will handle `PodGroup`s in FCFS policy; `enqueue` will go through all PodGroup
(by creation timestamp) and update PodGroup's phase to `InQueue` if:

* there're enough idle resources for `spec.minResources` of `PodGroup`
* there're enough quota for `spec.minResources` of `PodGroup`

As `kube-batch` handling `PodGroup` by `spec.minResources`, the operator/controller may create more `Pod`s than
`spec.minResources`; in such case, `preempt` action will be enhanced to evict overused `PodGroup` to release
resources.

### Admission Webhook

To guarantee the transaction of `spec.minResources`, a new `MutatingAdmissionWebhook`, named `PodGroupMinResources`,
is introduced. `PodGroupMinResources` make sure

* the summary of all PodGroups' `spec.minResources` in a namespace not more than `Quota`
* if resources are reserved by `spec.minResources`, the resources can not be used by others

Generally, it's better to let total `Quota` to be more than available resources in cluster, as some pods maybe
unschedulable because of scheduler's algorithm, e.g. predicates.

## Feature interaction

### Queue

The resources will be shared between `Queue`s algorithm, e.g. proportion by default. If the resources can not be
fully used because of fragment, `backfill` action will help on that. If `Queue` used more resources than its
deserved, `reclaim` action will help to balance resources. The Pod can not be evicted currently if eviction will
break `spec.minMember`; it'll be enhanced for job level eviction.

### Quota

To delay pod creation, both `kube-batch` and `PodGroupMinResources` will watch `ResourceQuota` to decide which
`PodGroup` should be in queue firstly. The decision maybe invalid because of race condition, e.g. other
controllers create Pods. In such case, `PodGroupMinResources` will reject `PodGroup` creation and keep `InQueue`
state until `kube-batch` transform it back to `Pending`. To avoid race condition, it's better to let `kube-batch`
manage `Pod` number and resources (e.g. CPU, memory) instead of `Quota`.

### Operator/Controller

The Operator/Controller should follow the above "protocol" to work together with scheduler. A new component,
named `PodGroupController`, will be introduced later to enforce this protocol if necessary.

## Others

### Compatibility

To support this new feature, a new state and a new action are introduced; so when the new `enqueue` action is
disabled in the configuration, it'll keep the same behaviour as before.

## Roadmap

* `InQueue` phase and `enqueue` action (v0.5+)
* Admission Controller (v0.6+)

## Reference

* [Coscheduling](https://github.com/kubernetes/enhancements/pull/639)
* [Delay Pod creation](https://github.com/kubernetes-sigs/kube-batch/issues/539)
* [PodGroup Status](https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/podgroup-status.md)
* [Support 'spec.TotalResources' in PodGroup](https://github.com/kubernetes-sigs/kube-batch/issues/401)
* [Dynamic Admission Control](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#write-an-admission-webhook-server)
* [Add support for podGroup number limits for one queue](https://github.com/kubernetes-sigs/kube-batch/issues/452)