Added NUMA Manager proposal draft.
This commit is contained in:
parent
5e8272ad36
commit
ddf37e122b
|
@ -0,0 +1,281 @@
|
|||
# NUMA Manager
|
||||
|
||||
_Authors:_
|
||||
|
||||
* @ConnorDoyle - Connor Doyle <connor.p.doyle@intel.com>
|
||||
* @balajismaniam - Balaji Subramaniam <balaji.subramaniam@intel.com>
|
||||
* @lmdaly - Louise M. Daly <louise.m.daly@intel.com>
|
||||
|
||||
**Contents:**
|
||||
|
||||
* [Overview](#overview)
|
||||
* [Motivation](#motivation)
|
||||
* [Goals](#goals)
|
||||
* [Non-Goals](#non-goals)
|
||||
* [User Stories](#user-stories)
|
||||
* [Proposal](#proposal)
|
||||
* [User Stories](#user-stories)
|
||||
* [Proposed Changes](#proposed-changes)
|
||||
* [New Component: NUMA Manager](#new-component-numa-manager)
|
||||
* [Computing Preferred Affinity](#computing-preferred-affinity)
|
||||
* [New Interfaces](#new-interfaces)
|
||||
* [Changes to Existing Components](#changes-to-existing-components)
|
||||
* [Graduation Criteria](#graduation-criteria)
|
||||
* [alpha (target v1.11)](#alpha-target-v1.11)
|
||||
* [beta](#beta)
|
||||
* [GA (stable)](#ga-stable)
|
||||
* [Challenges](#challenges)
|
||||
* [Limitations](#limitations)
|
||||
* [Alternatives](#alternatives)
|
||||
* [Reference](#reference)
|
||||
|
||||
# Overview
|
||||
|
||||
An increasing number of systems leverage a combination of CPUs and
|
||||
hardware accelerators to support latency-critical execution and
|
||||
high-throughput parallel computation. These include workloads in fields
|
||||
such as telecommunications, scientific computing, machine learning,
|
||||
financial services and data analytics. Such hybrid systems comprise a
|
||||
high performance environment.
|
||||
|
||||
In order to extract the best performance, optimizations related to CPU
|
||||
isolation and memory and device locality are required. However, in
|
||||
Kubernetes, these optimizations are handled by disjoint components.
|
||||
|
||||
This proposal provides a mechanism to coordinate fine-grained hardware
|
||||
resource assignments for different components in Kubernetes.
|
||||
|
||||
|
||||
# Motivation
|
||||
|
||||
Multiple components in the Kubelet make decisions about system
|
||||
topology-related assignments:
|
||||
|
||||
- CPU manager.
|
||||
- The CPU manager makes decisions about the set of CPUs a container is
|
||||
allowed to run on. The only implemented policy as of v1.8 is the static
|
||||
one, which does not change assignments for the lifetime of a container.
|
||||
- Device manager.
|
||||
- The device manager makes concrete device assignments to satisfy
|
||||
container resource requirements. Generally devices are attached to one
|
||||
peripheral interconnect. If the device manager and the CPU manager are
|
||||
misaligned, all communication between the CPU and the device can incur
|
||||
an additional hop over the processor interconnect fabric.
|
||||
- Container network interface
|
||||
- NICs including SR-IOV Virtual Functions have affinity to one NUMA node,
|
||||
with measurable performance ramifications.
|
||||
|
||||
*Related Issues:*
|
||||
|
||||
- [Hardware topology awareness at node level (including NUMA)][k8s-issue-49964]
|
||||
- [Discover nodes with NUMA architecture][nfd-issue-84]
|
||||
- [Support VF interrupt binding to specified CPU][sriov-issue-10]
|
||||
- [Proposal: CPU Affinity and NUMA Topology Awareness][proposal-affinity]
|
||||
|
||||
Note that all of these concerns pertain only to multi-socket systems.
|
||||
|
||||
## Goals
|
||||
|
||||
- Allow CPU manager and device plugin subsystem to agree on preferred
|
||||
NUMA node affinity for containers.
|
||||
- Provide an internal interface and pattern to integrate additional
|
||||
topology-aware Kubelet components.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- _Inter-device connectivity:_ Decide device assignments based on direct
|
||||
device interconnects. This issue can be separated from NUMA node
|
||||
locality. Inter-device topology can be considered entirely within the
|
||||
scope of the device plugin subsystem, after which it can emit possible
|
||||
NUMNA affinities. The policy to reach that decision can start simple
|
||||
and iterate toward support for arbitrary inter-device graphs.
|
||||
- _HugePages:_ This proposal assumes that pre-allocated HugePages are
|
||||
spread among the available NUMA nodes in the system. We further assume
|
||||
the operating system provides best-effort local page allocation for
|
||||
containers (as long as sufficient HugePages are free on the local NUMA
|
||||
node.
|
||||
- _CNI:_ Changing the Container Networking Interface is out of scope for
|
||||
this proposal. However, this design should be extensible enough to
|
||||
accommodate network interface locality if the CNI adds support in the
|
||||
future. This limitation is potentially mitigated by the possiblity to
|
||||
use the device plugin API as a stopgap solution for specialized
|
||||
networking requirements.
|
||||
|
||||
## User Stories
|
||||
|
||||
*Story 1: Fast virtualized network functions*
|
||||
|
||||
A user asks for a "fast network" and automatically gets all the various
|
||||
pieces coordinated (hugepages, cpusets, network device) co-located on a
|
||||
NUMA node.
|
||||
|
||||
*Story 2: Accelerated neural network training*
|
||||
|
||||
A user asks for an accelerator device and some number of exclusive CPUs
|
||||
in order to get the best training performance, due to NUMA-alignment of
|
||||
the assigned CPUs and devices.
|
||||
|
||||
# Proposal
|
||||
|
||||
*Main idea: Two Phase NUMA coherence protocol*
|
||||
|
||||
NUMA affinity is tracked at the container level, similar to devices and
|
||||
CPU affinity. At pod admission time, a new component called the NUMA Manager
|
||||
collects possible NUMA configurations from the Device Manager and the
|
||||
CPU Manager. The NUMA manager acts as an oracle for NUMA node affinity by
|
||||
those same components when they make concrete resource allocations. We
|
||||
expect the consulted components to use the inferred QoS class of each
|
||||
pod in order to prioritize the importance of fulfilling optimal NUMA
|
||||
affinity.
|
||||
|
||||
## Proposed Changes
|
||||
|
||||
### New Component: NUMA Manager
|
||||
|
||||
This proposal is focused on a new component in the Kubelet called the
|
||||
NUMA Manager. The NUMA Manager implements the pod admit handler
|
||||
interface and participates in Kubelet pod admission. When the `Admit()`
|
||||
function is called, the NUMA manager collects NUMA hints from from other
|
||||
Kubelet components.
|
||||
|
||||
If the NUMA hints do are not compatible, the NUMA
|
||||
manager could choose to reject the pod. The details of what to do in
|
||||
this situation needs more discussion. For example, the NUMA manager
|
||||
could enforce strict NUMA alignment for Guaranteed QoS pods.
|
||||
Alternatively, the NUMA manager could simply provide best-effort NUMA
|
||||
alignment for all pods.
|
||||
|
||||
The NUMA Manager component will be disabled behind a feature gate until
|
||||
graduation from alpha to beta.
|
||||
|
||||
#### Computing Preferred Affinity
|
||||
|
||||
A NUMA hint is a list of possible NUMA node masks. After collecting hints
|
||||
from all providers, the NUMA Manager must choose some mask that is
|
||||
present in all lists. Here is a sketch:
|
||||
|
||||
1. Apply a partial order on each list: number of bits set in the
|
||||
mask, ascending. This biases the result to be more precise if
|
||||
possible.
|
||||
1. Iterate over the permutations of preference lists and compute
|
||||
bitwise-and over the masks in each permutation.
|
||||
1. Store the first non-empty result and break out early.
|
||||
1. If no non-empty result exists, return an error.
|
||||
|
||||
#### New Interfaces
|
||||
|
||||
```go
|
||||
package numamanager
|
||||
|
||||
// NUMAManager helps to coordinate NUMA-related resource assignments
|
||||
// within the Kubelet.
|
||||
type Manager interface {
|
||||
lifecycle.PodAdmitHandler
|
||||
Store
|
||||
AddHintProvider(HintProvider)
|
||||
RemovePod(podName string)
|
||||
}
|
||||
|
||||
// NUMAMask is a bitmask-like type denoting a subset of available NUMA nodes.
|
||||
type NUMAMask struct{} // TBD
|
||||
|
||||
// NUMAStore manages state related to the NUMA manager.
|
||||
type Store interface {
|
||||
// GetAffinity returns the preferred NUMA affinity for the supplied
|
||||
// pod and container.
|
||||
GetAffinity(podName string, containerName string) NUMAMask
|
||||
}
|
||||
|
||||
// HintProvider is implemented by Kubelet components that make
|
||||
// NUMA-related resource assignments. The NUMA manager consults each
|
||||
// hint provider at pod admission time.
|
||||
type HintProvider interface {
|
||||
GetNUMAHints(pod v1.Pod, containerName string) []NUMAMask
|
||||
}
|
||||
```
|
||||
|
||||
_NUMA Manager and related interfaces (sketch)._
|
||||
|
||||

|
||||
|
||||
_NUMA Manager components._
|
||||
|
||||

|
||||
|
||||
_NUMA Manager instantiation and inclusion in pod admit lifecycle._
|
||||
|
||||
### Changes to Existing Components
|
||||
|
||||
1. Kubelet consults NUMA Manager for pod admission (discussed above.)
|
||||
1. Add two implementations of NUMA Manager interface and a feature gate.
|
||||
1. As much NUMA Manager functionality as possible is stubbed when the
|
||||
feature gate is disabled.
|
||||
1. Add a functional NUMA manager that queries hint providers in order
|
||||
to compute a preferred NUMA node mask for each container.
|
||||
1. Add `GetNUMAHints()` method to CPU Manager.
|
||||
1. CPU Manager static policy calls `GetAffinity()` method of NUMA
|
||||
manager when deciding CPU affinity.
|
||||
1. Add `GetNUMAHints()` method to Device Manager.
|
||||
1. Add NUMA Node ID to Device structure in the device plugin
|
||||
interface. Plugins should be able to determine the NUMA node
|
||||
easily when enumerating supported devices. For example, Linux
|
||||
exposes the node ID in sysfs for PCI devices:
|
||||
`/sys/devices/pci*/*/numa_node`.
|
||||
1. Device Manager calls `GetAffinity()` method of NUMA manager when
|
||||
deciding device allocation.
|
||||
|
||||

|
||||
|
||||
_NUMA Manager hint provider registration._
|
||||
|
||||

|
||||
|
||||
_NUMA Manager fetches affinity from hint providers._
|
||||
|
||||
# Graduation Criteria
|
||||
|
||||
## Alpha (target v1.11)
|
||||
|
||||
* Feature gate is disabled by default.
|
||||
* Alpha-level documentation.
|
||||
* Unit test coverage.
|
||||
* CPU Manager allocation policy takes NUMA hints into account.
|
||||
* Device plugin interface includes NUMA node ID.
|
||||
* Device Manager allocation policy takes NUMA hints into account.
|
||||
|
||||
## Beta
|
||||
|
||||
* Feature gate is enabled by default.
|
||||
* Alpha-level documentation.
|
||||
* Node e2e tests.
|
||||
* User feedback.
|
||||
|
||||
## GA (stable)
|
||||
|
||||
* *TBD*
|
||||
|
||||
# Challenges
|
||||
|
||||
* Testing the NUMA Manager in a continuous integration environment
|
||||
depends on cloud infrastructure to expose multi-node NUMA topologies
|
||||
to guest virtual machines.
|
||||
* Implementing the `GetNUMAHints()` interface may prove challenging.
|
||||
|
||||
# Limitations
|
||||
|
||||
* *TBD*
|
||||
|
||||
# Alternatives
|
||||
|
||||
* [AutoNUMA][numa-challenges]: This kernel feature affects memory
|
||||
allocation and thread scheduling, but does not address device locality.
|
||||
|
||||
# References
|
||||
|
||||
* *TBD*
|
||||
|
||||
[k8s-issue-49964]: https://github.com/kubernetes/kubernetes/issues/49964
|
||||
[nfd-issue-84]: https://github.com/kubernetes-incubator/node-feature-discovery/issues/84
|
||||
[sriov-issue-10]: https://github.com/hustcat/sriov-cni/issues/10
|
||||
[proposal-affinity]: https://github.com/kubernetes/community/pull/171
|
||||
[numa-challenges]: https://queue.acm.org/detail.cfm?id=2852078
|
Loading…
Reference in New Issue