437 lines
21 KiB
Markdown
437 lines
21 KiB
Markdown
|
|
Status: Draft
|
|
Created: 2018-04-09 / Last updated: 2018-08-15
|
|
Author: bsalamat
|
|
Contributors: misterikkit
|
|
|
|
---
|
|
|
|
#
|
|
- [SUMMARY ](#summary-)
|
|
- [OBJECTIVE](#objective)
|
|
- [Terminology](#terminology)
|
|
- [BACKGROUND](#background)
|
|
- [OVERVIEW](#overview)
|
|
- [Non-goals](#non-goals)
|
|
- [DETAILED DESIGN](#detailed-design)
|
|
- [Bare bones of scheduling](#bare-bones-of-scheduling)
|
|
- [Communication and statefulness of plugins](#communication-and-statefulness-of-plugins)
|
|
- [Plugin registration](#plugin-registration)
|
|
- [Extension points](#extension-points)
|
|
- [Scheduling queue sort](#scheduling-queue-sort)
|
|
- [Pre-filter](#pre-filter)
|
|
- [Filter](#filter)
|
|
- [Post-filter](#post-filter)
|
|
- [Scoring](#scoring)
|
|
- [Post-scoring/pre-reservation](#post-scoringpre-reservation)
|
|
- [Reserve](#reserve)
|
|
- [Permit](#permit)
|
|
- [Approving a Pod binding](#approving-a-pod-binding)
|
|
- [Reject](#reject)
|
|
- [Pre-Bind](#pre-bind)
|
|
- [Bind](#bind)
|
|
- [Post Bind](#post-bind)
|
|
- [USE-CASES](#use-cases)
|
|
- [Dynamic binding of cluster-level resources](#dynamic-binding-of-cluster-level-resources)
|
|
- [Gang Scheduling](#gang-scheduling)
|
|
- [OUT OF PROCESS PLUGINS](#out-of-process-plugins)
|
|
- [CONFIGURING THE SCHEDULING FRAMEWORK](#configuring-the-scheduling-framework)
|
|
- [BACKWARD COMPATIBILITY WITH SCHEDULER v1](#backward-compatibility-with-scheduler-v1)
|
|
- [DEVELOPMENT PLAN](#development-plan)
|
|
- [TESTING PLAN](#testing-plan)
|
|
- [WORK ESTIMATES ](#work-estimates)
|
|
|
|
# SUMMARY
|
|
|
|
This document describes the Kubernetes Scheduling Framework. The scheduling
|
|
framework implements only basic functionality, but exposes many extension points
|
|
for plugins to expand its functionality. The plan is that this framework (with
|
|
its plugins) will eventually replace the current Kubernetes scheduler.
|
|
|
|
# OBJECTIVE
|
|
|
|
- make scheduler more extendable.
|
|
- Make scheduler core simpler by moving some of its features to plugins.
|
|
- Propose extension points in the framework.
|
|
- Propose a mechanism to receive plugin results and continue or abort based
|
|
on the received results.
|
|
- Propose a mechanism to handle errors and communicate it with plugins.
|
|
|
|
## Terminology
|
|
|
|
Scheduler v1, current scheduler: refer to existing scheduler of Kubernetes.
|
|
Scheduler v2, scheduling framework: refer to the new scheduler proposed in this
|
|
doc.
|
|
|
|
# BACKGROUND
|
|
|
|
Many features are being added to the Kubernetes default scheduler. They keep
|
|
making the code larger and logic more complex. A more complex scheduler is
|
|
harder to maintain, its bugs are harder to find and fix, and those users running
|
|
a custom scheduler have a hard time catching up and integrating new changes.
|
|
The current Kubernetes scheduler provides
|
|
[webhooks to extend](./scheduler_extender.md)
|
|
its functionality. However, these are limited in a few ways:
|
|
|
|
1. The number of extension points are limited: "Filter" extenders are called
|
|
after default predicate functions. "Prioritize" extenders are called after
|
|
default priority functions. "Preempt" extenders are called after running
|
|
default preemption mechanism. "Bind" verb of the extenders are used to bind
|
|
a Pod. Only one of the extenders can be a binding extender, and that
|
|
extender performs binding instead of the scheduler. Extenders cannot be
|
|
invoked at other points, for example, they cannot be called before running
|
|
predicate functions.
|
|
1. Every call to the extenders involves marshaling and unmarshalling JSON.
|
|
Calling a webhook (HTTP request) is also slower than calling native functions.
|
|
1. It is hard to inform an extender that scheduler has aborted scheduling of
|
|
a Pod. For example, if an extender provisions a cluster resource and
|
|
scheduler contacts the extender and asks it to provision an instance of the
|
|
resource for the Pod being scheduled and then scheduler faces errors
|
|
scheduling the Pod and decides to abort the scheduling, it will be hard to
|
|
communicate the error with the extender and ask it to undo the provisioning
|
|
of the resource.
|
|
1. Since current extenders run as a separate process, they cannot use
|
|
scheduler's cache. They must either build their own cache from the API
|
|
server or process only the information they receive from the default scheduler.
|
|
|
|
The above limitations hinder building high performance and versatile scheduler
|
|
extensions. We would ideally like to have an extension mechanism that is fast
|
|
enough to allow keeping a bare minimum logic in the scheduler core and convert
|
|
many of the existing features of default scheduler, such as predicate and
|
|
priority functions and preemption into plugins. Such plugins will be compiled
|
|
with the scheduler. We would also like to provide an extension mechanism that do
|
|
not need recompilation of scheduler. The expected performance of such plugins is
|
|
lower than in-process plugins. Such out-of-process plugins should be used in
|
|
cases where quick invocation of the plugin is not a constraint.
|
|
|
|
# OVERVIEW
|
|
|
|
Scheduler v2 allows both built-in and out-of-process extenders. This new
|
|
architecture is a scheduling framework that exposes several extension points
|
|
during a scheduling cycle. Scheduler plugins can register to run at one or more
|
|
extension points.
|
|
|
|
#### Non-goals
|
|
|
|
- We will keep Kubernetes API backward compatibility, but keeping scheduler
|
|
v1 backward compatibility is a non-goal. Particularly, scheduling policy
|
|
config and v1 extenders won't work in this new framework.
|
|
- Solve all the scheduler v1 limitations, although we would like to ensure
|
|
that the new framework allows us to address known limitations in the future.
|
|
- Provide implementation details of plugins and call-back functions, such as
|
|
all of their arguments and return values.
|
|
|
|
# DETAILED DESIGN
|
|
|
|
## Bare bones of scheduling
|
|
|
|
Pods that are not assigned to any node go to a scheduling queue and sorted by
|
|
order specified by plugins (described [here](#scheduling-queue-sort)). The
|
|
scheduling framework picks the head of the queue and starts a **scheduling
|
|
cycle** to schedule the pod. At the end of the cycle scheduler determines
|
|
whether the pod is schedulable or not. If the pod is not schedulable, its status
|
|
is updated and goes back to the scheduling queue. If the pod is schedulable (one
|
|
or more nodes are found that can run the Pod), the scoring process is started.
|
|
The scoring process finds the best node to run the Pod. Once the best node is
|
|
picked, the scheduler updates its cache and then a bind go routine is started to
|
|
bind the pod.
|
|
The above process is the same as what Kubernetes scheduler v1 does. Some of the
|
|
essential features of scheduler v1, such as leader election, will also be
|
|
transferred to the scheduling framework.
|
|
In the rest of this section we describe how various plugins are used to enrich
|
|
this basic workflow. This document focuses on in-process plugins.
|
|
Out-of-process plugins are discussed later in a separate doc.
|
|
|
|
## Communication and statefulness of plugins
|
|
|
|
The scheduling framework provides a library that plugins can use to pass
|
|
information to other plugins. This library keeps a map from keys of type string
|
|
to opaque pointers of type interface{}. A write operation takes a key and a
|
|
pointer and stores the opaque pointer in the map with the given key. Other
|
|
plugins can provide the key and receive the opaque pointer. Multiple plugins can
|
|
share the state or communicate via this mechanism.
|
|
The saved state is preserved only during a single scheduling cycle. At the end
|
|
of a scheduling cycle, this map is destructed. So, plugins cannot keep shared
|
|
state across multiple scheduling cycle. They can, however, update the scheduler
|
|
cache via the provided interface of the cache. The cache interface allows
|
|
limited state preservation across multiple scheduling cycle.
|
|
It is worth noting that plugins are assumed to be **trusted**. Scheduler does
|
|
not prevent one plugin from accessing or modifying another plugin's state.
|
|
|
|
## Plugin registration
|
|
|
|
Plugin registration is done by providing an extension point and a function that
|
|
should be called at that extension point. This step will be something like:
|
|
|
|
```go
|
|
register("pre-filter", plugin.foo)
|
|
```
|
|
|
|
The details of the function signature will be provided later.
|
|
|
|
## Extension points
|
|
|
|
The following picture shows the scheduling cycle of a Pod and the extension
|
|
points that the scheduling framework exposes. In this picture "Filter" is
|
|
equivalent to "Predicate" in scheduler v1 and "Scoring" is equivalent to
|
|
"Priority function". Plugins are go functions. They are registered to be called
|
|
at one of these extension points. They are called by the framework in the same
|
|
order they are registered for each extension point.
|
|
In the following sections we describe each extension point in the same order
|
|
they are called in a schedule cycle.
|
|
|
|

|
|
|
|
### Scheduling queue sort
|
|
|
|
These plugins indicate how Pods should be sorted in the scheduling queue. A
|
|
plugin registered at this point only returns greater, smaller, or equal to
|
|
indicate an ordering between two Pods. In other words, a plugin at this
|
|
extension point returns the answer to "less(pod1, pod2)". Multiple plugins may
|
|
be registered at this point. Plugins registered at this point are called in
|
|
order and the invocation continues as long as plugins return "equal". Once a
|
|
plugin returns "greater" or "smaller" the invocation of these plugins are
|
|
stopped.
|
|
|
|
### Pre-filter
|
|
|
|
These plugins are generally useful to check certain conditions that the cluster
|
|
or the Pod must meet. These are also useful to perform pre-processing on the pod
|
|
and store some information about the pod that can be used by other plugins.
|
|
The pod pointer is passed as an argument to these plugins. If any of these
|
|
plugins return an error, the scheduling cycle is aborted.
|
|
These plugins are called serially in the same order registered.
|
|
|
|
### Filter
|
|
|
|
Filter plugins filter out nodes that cannot run the Pod. Scheduler runs these
|
|
plugins per node in the same order that they are registered, but scheduler may
|
|
run these filter function for multiple nodes in parallel. So, these plugins must
|
|
use synchronization when they modify state.
|
|
Scheduler stops running the remaining filter functions for a node once one of
|
|
these filters fails for the node.
|
|
|
|
### Post-filter
|
|
|
|
The Pod and the set of nodes that can run the Pod are passed to these plugins.
|
|
They are called whether Pod is schedulable or not (whether the set of nodes is
|
|
empty or non-empty).
|
|
If any of these plugins return an error or if the Pod is determined
|
|
unschedulable, the scheduling cycle is aborted.
|
|
These plugins are called serially.
|
|
|
|
### Scoring
|
|
|
|
These plugins are similar to priority function in scheduler v1. They are
|
|
utilized to rank nodes that have passed the filtering stage. Similar to Filter
|
|
plugins, these are called per node serially in the same order registered, but
|
|
scheduler may run them for multiple nodes in parallel.
|
|
Each one of these functions return a score for the given node. The score is
|
|
multiplied by the weight of the function and aggregated with the result of other
|
|
scoring functions to yield a total score for the node.
|
|
These functions can never block scheduling. In case of an error they should
|
|
return zero for the Node being ranked.
|
|
|
|
### Post-scoring/pre-reservation
|
|
|
|
After all scoring plugins are invoked and the score of nodes are determined, the
|
|
framework picks the best node with the highest score and then it calls
|
|
post-scoring plugins. The Pod and the chosen Node are passed to these plugins.
|
|
These plugins have one more chance to check any conditions about the assignment
|
|
of the Pod to this Node and reject the node if needed.
|
|
|
|

|
|
|
|
### Reserve
|
|
|
|
At this point scheduler updates its cache by "reserving" a Node (partially or
|
|
fully) for the Pod. In scheduler v1 this stage is called "assume".
|
|
At this point, only the scheduler cache is updated to
|
|
reflect that the Node is (partially) reserved for the Pod. The scheduling
|
|
framework calls plugins registered at this extension points so that they get a
|
|
chance to perform cache updates or other accounting activities. These plugins
|
|
do not return any value (except errors).
|
|
|
|
The actual assignment of the Node to the Pod happens during the "Bind" phase.
|
|
That is when the API server updates the Pod object with the Node information.
|
|
|
|
### Permit
|
|
|
|
Permit plugins run in a separate go routine (in parallel). Each plugin can return
|
|
one of the three possible values: 1) "permit", 2) "deny", or 3) "wait". If all
|
|
plugins registered at this extension point return "permit", the pod is sent to
|
|
the next step for binding. If any of the plugins returns "deny", the pod is
|
|
rejected and sent back to the scheduling queue. If any of the plugins returns
|
|
"wait", the Pod is kept in reserved state until it is explicitly approved for
|
|
binding. A plugin that returns "wait" must return a "timeout" as well. If the
|
|
timeout expires, the pod is rejected and goes back to the scheduling queue.
|
|
|
|
#### Approving a Pod binding
|
|
|
|
While any plugin can receive the list of reserved Pod from the cache and approve
|
|
them, we expect only the "Permit" plugins to approve binding of reserved Pods
|
|
that are in "waiting" state. Once a Pod is approved, it is sent to the Bind
|
|
stage.
|
|
|
|
### Reject
|
|
|
|
Plugins called at "Permit" may perform some operations that should be undone if
|
|
the Pod reservation fails. The "Reject" extension point allows such clean-up
|
|
operations to happen. Plugins registered at this point are called if the
|
|
reservation of the Pod is cancelled. The reservation is cancelled if any of the
|
|
"Permit" plugins returns "reject" or if a Pod reservation, which is in "wait"
|
|
state, times out.
|
|
|
|
### Pre-Bind
|
|
|
|
When a Pod is approved for binding it reaches to this stage. These plugins run
|
|
before the actual binding of the Pod to a Node happens. The binding starts only
|
|
if all of these plugins return true. If any returns false, the Pod is rejected
|
|
and sent back to the scheduling queue. These plugins run in a separate go
|
|
routine. The same go routine runs "Bind" after these plugins when all of them
|
|
return true.
|
|
|
|
### Bind
|
|
|
|
Once all pre-bind plugins return true, the Bind plugins are executed. Multiple
|
|
plugins may be registered at this extension point. Each plugin may return true
|
|
or false (or an error). If a plugin returns false, the next plugin will be
|
|
called until a plugin returns true. Once a true is returned **the remaining
|
|
plugins are skipped**. If any of the plugins returns an error or all of them
|
|
return false, the Pod is rejected and sent back to the scheduling queue.
|
|
|
|
### Post Bind
|
|
|
|
The Post Bind plugins can be useful for housekeeping after a pod is scheduled.
|
|
These plugins do not return any value and are not expected to influence the
|
|
scheduling decision made in the scheduling cycle.
|
|
|
|
### Informer Events
|
|
|
|
The scheduling framework, similar to Scheduler v1, will have informers that let
|
|
the framework keep its copy of the state of the cluster up-to-date. The
|
|
informers generate events, such as "PodAdd", "PodUpdate", "PodDelete", etc. The
|
|
framework allows plugins to register their own handlers for any of these events.
|
|
The handlers allow plugins with internal state or caches to keep their state
|
|
updated.
|
|
|
|
# USE-CASES
|
|
|
|
In this section we provide a couple of examples on how the scheduling framework
|
|
can be used to solve common scheduling scenarios.
|
|
|
|
### Dynamic binding of cluster-level resources
|
|
|
|
Cluster level resources are resources which are not immediately available on
|
|
nodes at the time of scheduling Pods. Scheduler needs to ensure that such
|
|
cluster level resources are bound to a chosen Node before it can schedule a Pod
|
|
that requires such resources to the Node. We refer to this type of binding of
|
|
resources to Nodes at the time of scheduling Pods as dynamic resource binding.
|
|
Dynamic resource binding has proven to be a challenge in Scheduler v1, because
|
|
Scheduler v1 is not flexible enough to support various types of plugins at
|
|
different phases of scheduling. As a result, binding of storage volumes is
|
|
integrated in the scheduler code and some non-trivial changes are done to the
|
|
scheduler extender to support dynamic binding of network GPUs.
|
|
The scheduling framework allows such dynamic bindings in a cleaner way. The main
|
|
thread of scheduling framework process a pending Pod that requests a network
|
|
resource and finds a node for the Pod and reserves the Pod. A dynamic resource
|
|
binder plugin installed at "Pre-Bind" stage is invoked (in a separate thread).
|
|
It analyzes the Pod and when detects that the Pod needs dynamic binding of the
|
|
resource, the plugin tries to attach the cluster resource to the chosen node and
|
|
then returns true so that the Pod can be bound. If the resource attachment
|
|
fails, it returns false and the Pod will be retried.
|
|
When there are multiple of such network resources, each one of them installs one
|
|
"pre-bind" plugin. Each plugin looks at the Pod and if the Pod is not requesting
|
|
the resource that they are interested in, they simply return "true" for the
|
|
pod.
|
|
|
|
### Gang Scheduling
|
|
|
|
Gang scheduling allows a certain number of Pods to be scheduled simultaneously.
|
|
If all the members of the gang cannot be scheduled at the same time, none of
|
|
them should be scheduled. Gang scheduling may have various other features as
|
|
well, but in this context we are interested in simultaneous scheduling of Pods.
|
|
Gang scheduling in the scheduling framework can be done with an "Permit" plugin.
|
|
The main scheduling thread processes pods one by one and reserves nodes for
|
|
them. The gang scheduling plugin at the Permit stage is invoked for each pod.
|
|
When it finds that the pod belongs to a gang, it checks the properties of the
|
|
gang. If there are not enough members of the gang which are scheduled or in
|
|
"wait" state, the plugin returns "wait". When the number reaches the desired
|
|
value, all the Pods in wait state are approved and sent for binding.
|
|
|
|
# OUT OF PROCESS PLUGINS
|
|
|
|
Out of process plugins (OOPP) are called via JSON over an HTTP interface. In
|
|
other words, the scheduler will support webhooks at most (maybe all) of the
|
|
extension points. Data sent to an OOPP must be marshalled to JSON and data
|
|
received must be unmarshalled. So, calling an OOPP is significantly slower than
|
|
in-process plugins.
|
|
We do not plan to build OOPPs in the first version of the scheduling framework.
|
|
So, more details on them is to be determined.
|
|
|
|
|
|
# DEVELOPMENT PLAN
|
|
|
|
Earlier, we wanted to develop the scheduling framework as an independent project
|
|
from scheduler V1. However, that would need much engineering resources.
|
|
It would also be more difficult to roll out a new and not fully-backward
|
|
compatible scheduler in Kubernetes where tens of thousands of users depend on
|
|
the behavior of the scheduler.
|
|
After revisiting the ideas and challenges, we changed our plan and have decided
|
|
to build some of the ideas of the scheduling framework into Scheduler V1 to make
|
|
it more extendable.
|
|
|
|
As the first step, we would like to build:
|
|
1. [Pre-bind](#pre-bind) and [Reserve](#reserve) plugin points. These will
|
|
help us move our existing cluster resource binding code, such as persistent
|
|
volume binding, to plugins.
|
|
1. We will also build
|
|
[the plugin communication mechanism](#communication-and-statefulness-of-plugins).
|
|
This will allow us to build more sophisticated plugins that would require
|
|
communication and also help us clean up existing scheduler's code by removing
|
|
existing transient cache data.
|
|
|
|
More features of the framework can be added to the Scheduler in the future based
|
|
on the requirements.
|
|
|
|
<s>
|
|
# CONFIGURING THE SCHEDULING FRAMEWORK
|
|
|
|
TBD
|
|
|
|
# BACKWARD COMPATIBILITY WITH SCHEDULER v1
|
|
|
|
We will build a new set of plugins for scheduler v2 to ensure that the existing
|
|
behavior of scheduler v1 in placing Pods on nodes is preserved. This includes
|
|
building plugins that replicate default predicate and priority functions of
|
|
scheduler v1 and its binding mechanism, but scheduler extenders built for
|
|
scheduler v1 won't be compatible with scheduler v2. Also, predicate and priority
|
|
functions which are not enabled by default (such as service affinity) are not
|
|
guaranteed to exist in scheduler v2.
|
|
|
|
# DEVELOPMENT PLAN
|
|
|
|
We will develop the scheduling framework as an incubator project in SIG
|
|
scheduling. It will be built in a separate code-base independently from
|
|
scheduler v1, but we will probably use a lot of code from scheduler v1.
|
|
|
|
# TESTING PLAN
|
|
|
|
We will add unit-tests as we build functionalities of the scheduling framework.
|
|
The scheduling framework should eventually be able to pass integration and e2e
|
|
tests of scheduler v1, excluding those tests that involve scheduler extensions.
|
|
The e2e and integration tests may need to be modified slightly as the
|
|
initialization and configuration of the scheduling framework will be different
|
|
than scheduler v1.
|
|
|
|
# WORK ESTIMATES
|
|
|
|
We expect to see an early version of the scheduling framework in two release
|
|
cycles (end of 2018). If things go well, we will start offering it as an
|
|
alternative to the scheduler v1 by the end of Q1 2019 and start the deprecation
|
|
of scheduler v1. We will make it the default scheduler of Kubernetes in Q2 2019,
|
|
but we will keep the option of using scheduler v1 for at least two more release
|
|
cycles.
|
|
</s>
|
|
|