Added NUMA Manager proposal draft.

This commit is contained in:
Connor Doyle 2018-01-24 21:13:50 -08:00
parent 5e8272ad36
commit ddf37e122b
1 changed files with 281 additions and 0 deletions

View File

@ -0,0 +1,281 @@
# NUMA Manager
_Authors:_
* @ConnorDoyle - Connor Doyle <connor.p.doyle@intel.com>
* @balajismaniam - Balaji Subramaniam <balaji.subramaniam@intel.com>
* @lmdaly - Louise M. Daly <louise.m.daly@intel.com>
**Contents:**
* [Overview](#overview)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [User Stories](#user-stories)
* [Proposal](#proposal)
* [User Stories](#user-stories)
* [Proposed Changes](#proposed-changes)
* [New Component: NUMA Manager](#new-component-numa-manager)
* [Computing Preferred Affinity](#computing-preferred-affinity)
* [New Interfaces](#new-interfaces)
* [Changes to Existing Components](#changes-to-existing-components)
* [Graduation Criteria](#graduation-criteria)
* [alpha (target v1.11)](#alpha-target-v1.11)
* [beta](#beta)
* [GA (stable)](#ga-stable)
* [Challenges](#challenges)
* [Limitations](#limitations)
* [Alternatives](#alternatives)
* [Reference](#reference)
# Overview
An increasing number of systems leverage a combination of CPUs and
hardware accelerators to support latency-critical execution and
high-throughput parallel computation. These include workloads in fields
such as telecommunications, scientific computing, machine learning,
financial services and data analytics. Such hybrid systems comprise a
high performance environment.
In order to extract the best performance, optimizations related to CPU
isolation and memory and device locality are required. However, in
Kubernetes, these optimizations are handled by disjoint components.
This proposal provides a mechanism to coordinate fine-grained hardware
resource assignments for different components in Kubernetes.
# Motivation
Multiple components in the Kubelet make decisions about system
topology-related assignments:
- CPU manager.
- The CPU manager makes decisions about the set of CPUs a container is
allowed to run on. The only implemented policy as of v1.8 is the static
one, which does not change assignments for the lifetime of a container.
- Device manager.
- The device manager makes concrete device assignments to satisfy
container resource requirements. Generally devices are attached to one
peripheral interconnect. If the device manager and the CPU manager are
misaligned, all communication between the CPU and the device can incur
an additional hop over the processor interconnect fabric.
- Container network interface
- NICs including SR-IOV Virtual Functions have affinity to one NUMA node,
with measurable performance ramifications.
*Related Issues:*
- [Hardware topology awareness at node level (including NUMA)][k8s-issue-49964]
- [Discover nodes with NUMA architecture][nfd-issue-84]
- [Support VF interrupt binding to specified CPU][sriov-issue-10]
- [Proposal: CPU Affinity and NUMA Topology Awareness][proposal-affinity]
Note that all of these concerns pertain only to multi-socket systems.
## Goals
- Allow CPU manager and device plugin subsystem to agree on preferred
NUMA node affinity for containers.
- Provide an internal interface and pattern to integrate additional
topology-aware Kubelet components.
## Non-Goals
- _Inter-device connectivity:_ Decide device assignments based on direct
device interconnects. This issue can be separated from NUMA node
locality. Inter-device topology can be considered entirely within the
scope of the device plugin subsystem, after which it can emit possible
NUMNA affinities. The policy to reach that decision can start simple
and iterate toward support for arbitrary inter-device graphs.
- _HugePages:_ This proposal assumes that pre-allocated HugePages are
spread among the available NUMA nodes in the system. We further assume
the operating system provides best-effort local page allocation for
containers (as long as sufficient HugePages are free on the local NUMA
node.
- _CNI:_ Changing the Container Networking Interface is out of scope for
this proposal. However, this design should be extensible enough to
accommodate network interface locality if the CNI adds support in the
future. This limitation is potentially mitigated by the possiblity to
use the device plugin API as a stopgap solution for specialized
networking requirements.
## User Stories
*Story 1: Fast virtualized network functions*
A user asks for a "fast network" and automatically gets all the various
pieces coordinated (hugepages, cpusets, network device) co-located on a
NUMA node.
*Story 2: Accelerated neural network training*
A user asks for an accelerator device and some number of exclusive CPUs
in order to get the best training performance, due to NUMA-alignment of
the assigned CPUs and devices.
# Proposal
*Main idea: Two Phase NUMA coherence protocol*
NUMA affinity is tracked at the container level, similar to devices and
CPU affinity. At pod admission time, a new component called the NUMA Manager
collects possible NUMA configurations from the Device Manager and the
CPU Manager. The NUMA manager acts as an oracle for NUMA node affinity by
those same components when they make concrete resource allocations. We
expect the consulted components to use the inferred QoS class of each
pod in order to prioritize the importance of fulfilling optimal NUMA
affinity.
## Proposed Changes
### New Component: NUMA Manager
This proposal is focused on a new component in the Kubelet called the
NUMA Manager. The NUMA Manager implements the pod admit handler
interface and participates in Kubelet pod admission. When the `Admit()`
function is called, the NUMA manager collects NUMA hints from from other
Kubelet components.
If the NUMA hints do are not compatible, the NUMA
manager could choose to reject the pod. The details of what to do in
this situation needs more discussion. For example, the NUMA manager
could enforce strict NUMA alignment for Guaranteed QoS pods.
Alternatively, the NUMA manager could simply provide best-effort NUMA
alignment for all pods.
The NUMA Manager component will be disabled behind a feature gate until
graduation from alpha to beta.
#### Computing Preferred Affinity
A NUMA hint is a list of possible NUMA node masks. After collecting hints
from all providers, the NUMA Manager must choose some mask that is
present in all lists. Here is a sketch:
1. Apply a partial order on each list: number of bits set in the
mask, ascending. This biases the result to be more precise if
possible.
1. Iterate over the permutations of preference lists and compute
bitwise-and over the masks in each permutation.
1. Store the first non-empty result and break out early.
1. If no non-empty result exists, return an error.
#### New Interfaces
```go
package numamanager
// NUMAManager helps to coordinate NUMA-related resource assignments
// within the Kubelet.
type Manager interface {
lifecycle.PodAdmitHandler
Store
AddHintProvider(HintProvider)
RemovePod(podName string)
}
// NUMAMask is a bitmask-like type denoting a subset of available NUMA nodes.
type NUMAMask struct{} // TBD
// NUMAStore manages state related to the NUMA manager.
type Store interface {
// GetAffinity returns the preferred NUMA affinity for the supplied
// pod and container.
GetAffinity(podName string, containerName string) NUMAMask
}
// HintProvider is implemented by Kubelet components that make
// NUMA-related resource assignments. The NUMA manager consults each
// hint provider at pod admission time.
type HintProvider interface {
GetNUMAHints(pod v1.Pod, containerName string) []NUMAMask
}
```
_NUMA Manager and related interfaces (sketch)._
![numa-manager-components](https://user-images.githubusercontent.com/379372/35370509-13dd9488-0143-11e8-998b-6b5115982842.png)
_NUMA Manager components._
![numa-manager-instantiation](https://user-images.githubusercontent.com/379372/35370513-17f90f70-0143-11e8-88e3-f199e9717946.png)
_NUMA Manager instantiation and inclusion in pod admit lifecycle._
### Changes to Existing Components
1. Kubelet consults NUMA Manager for pod admission (discussed above.)
1. Add two implementations of NUMA Manager interface and a feature gate.
1. As much NUMA Manager functionality as possible is stubbed when the
feature gate is disabled.
1. Add a functional NUMA manager that queries hint providers in order
to compute a preferred NUMA node mask for each container.
1. Add `GetNUMAHints()` method to CPU Manager.
1. CPU Manager static policy calls `GetAffinity()` method of NUMA
manager when deciding CPU affinity.
1. Add `GetNUMAHints()` method to Device Manager.
1. Add NUMA Node ID to Device structure in the device plugin
interface. Plugins should be able to determine the NUMA node
easily when enumerating supported devices. For example, Linux
exposes the node ID in sysfs for PCI devices:
`/sys/devices/pci*/*/numa_node`.
1. Device Manager calls `GetAffinity()` method of NUMA manager when
deciding device allocation.
![numa-manager-wiring](https://user-images.githubusercontent.com/379372/35370514-1e10fb84-0143-11e8-84d3-99c9ca3af111.png)
_NUMA Manager hint provider registration._
![numa-manager-hints](https://user-images.githubusercontent.com/379372/35370517-234a5d34-0143-11e8-845a-80e5c66c7b72.png)
_NUMA Manager fetches affinity from hint providers._
# Graduation Criteria
## Alpha (target v1.11)
* Feature gate is disabled by default.
* Alpha-level documentation.
* Unit test coverage.
* CPU Manager allocation policy takes NUMA hints into account.
* Device plugin interface includes NUMA node ID.
* Device Manager allocation policy takes NUMA hints into account.
## Beta
* Feature gate is enabled by default.
* Alpha-level documentation.
* Node e2e tests.
* User feedback.
## GA (stable)
* *TBD*
# Challenges
* Testing the NUMA Manager in a continuous integration environment
depends on cloud infrastructure to expose multi-node NUMA topologies
to guest virtual machines.
* Implementing the `GetNUMAHints()` interface may prove challenging.
# Limitations
* *TBD*
# Alternatives
* [AutoNUMA][numa-challenges]: This kernel feature affects memory
allocation and thread scheduling, but does not address device locality.
# References
* *TBD*
[k8s-issue-49964]: https://github.com/kubernetes/kubernetes/issues/49964
[nfd-issue-84]: https://github.com/kubernetes-incubator/node-feature-discovery/issues/84
[sriov-issue-10]: https://github.com/hustcat/sriov-cni/issues/10
[proposal-affinity]: https://github.com/kubernetes/community/pull/171
[numa-challenges]: https://queue.acm.org/detail.cfm?id=2852078