Device plugin proposal patch by Jiaying

This commit is contained in:
Jiaying Zhang 2017-07-19 17:53:03 -07:00 committed by Renaud Gaubert
parent 9d7a245a23
commit 4cbc77e491
2 changed files with 304 additions and 129 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

View File

@ -1,99 +1,77 @@
# Device Manager Proposal
Device Manager Proposal
===============
1. [Abstract](#abstract)
2. [Motivation](#motivation)
3. [Use Cases](#use-cases)
4. [Objectives](#objectives)
5. [Non Objectives](#non-objectives)
6. [Stories](#stories)
* [Vendor story](#vendor-story)
* [User story](#user-story)
8. [Device Plugin](#device-plugin)
* [Protocol Overview](#protocol-overview)
* [Protobuf specification](#protobuf-specification)
* [Installation](#installation)
* [API Changes](#api-changes)
* [Versioning](#versioning)
<!-- BEGIN MUNGE: GENERATED_TOC -->
- [Motivation](#motivation)
- [Use Cases](#use-cases)
- [Objectives](#objectives)
- [Non Objectives](#non-objectives)
- [Proposed Implementation 1](#proposed-implementation-1)
- [Vendor story](#vendor-story)
- [End User story](#end-user-story)
- [Device Plugin](#device-plugin)
- [Introduction](#introduction)
- [Registration](#registration)
- [Unix Socket](#unix-socket)
- [Protocol Overview](#protocol-overview)
- [Protobuf specification](#protobuf-specification)
- [Proposed Implementation 2](#proposed-implementation-2)
- [Device Plugin Lifecycle](#device-plugin-lifecycle)
- [Protobuf API](#protobuf-api)
- [Failure recovery](#failure-recovery)
- [Roadmap](#roadmap)
- [Open Questions](#open-questions-1)
- [Installation](#installation)
- [Versioning](#versioning)
- [References](#references)
<!-- END MUNGE: GENERATED_TOC -->
_Authors:_
* @RenaudWasTaken - Renaud Gaubert &lt;rgaubert@NVIDIA.com&gt;
## Abstract
This document describes a vendor independant solution to:
* Discovering and representing external devices
* Making these devices available to the container and cleaning them up
afterwards
* Health Check of these devices
Because devices are vendor dependant and have their own sets of problems
and mechanisms, the solution we describe is a plugin mechanism managed by
Kubelet.
At their core, device plugins are simple gRPC servers that may run in a
container deployed through the pod mechanism.
These servers implement the gRPC interface defined later in this design
document and once the device plugin makes itself know to kubelet, kubelet
will interact with the device through three simple functions:
1. A `Discover` function for the kubelet to Discover the devices and
their properties.
2. An `Allocate` and `Deallocate` function which are called respectively
before container creation and after container deletion with the
devices to allocate and deallocate.
3. A `Monitor` function to notify Kubelet whenever a device becomes
unhealthy.
The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
the simple following steps:
* `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
* When launching `kubectl describe nodes`, the devices appear in the node spec
* In the long term users will be able to select them through Resource Class
We expect the plugins to be deployed across the clusters through DaemonSets.
The targeted devices are GPUs, NICs, FPGAs, InfiniBand, Storage devices, ....
## Motivation
# Motivation
Kubernetes currently supports discovery of CPU and Memory primarily to a
minimal extent. Very few devices are handled natively by Kubelet.
It is not a sustainable solution to expect every vendor to add their vendor
specific code inside Kubernetes. This approach does not scale and is not
portable.
specific code inside Kubernetes to make their devices usable.
Instead, we want a solution for vendors to be able to advertise their resources
to Kubelet and monitor them without writing custom Kubernetes code.
We also want to provide a consistent and portable solution for users to
consume hardware devices across k8s clusters.
We want a solution for those vendors to be able to advertise their resources
to kubelet and monitor them.
We also want a way for the user to specify which resource their jobs will use
and what constraints are associated to these resources.
This document describes a vendor independant solution to:
* Discovering and representing external devices
* Making these devices available to the containers using these devices and
cleaning them up afterwards
* Monitoring these devices
In order to solve this problem it is obvious that we need a plugin system in
order to have vendors advertise and monitor their resources on behalf
of Kubelet.
Because devices are vendor dependant and have their own sets of problems
and mechanisms, the solution we describe is a plugin mechanism that may run
in a container deployed through the DaemonSets mechanism.
The targeted devices include GPUs, High-performance NICs, FPGAs, InfiniBand,
Storage devices, and other similar computing resources that require vendor
specific initialization and setup.
Additionally, we introduce the concept of Device to be able to select
resources with constraints in a pod spec.
The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
the following simple steps:
* `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
* When launching `kubectl describe nodes`, the devices appear in the node spec
* In the long term users will be able to select them through Resource Class
_GPU Integration Example:_
* [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
* [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
# Use Cases
_Kubernetes Meeting Notes On This:_
* [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
* [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
* I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
in my pod.
* I should be able to use that device without writing custom Kubernetes code.
* I want a consistent and portable solution to consume hardware devices
across k8s clusters.
## Use Cases
* I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
in my pod.
* I should be able to use that device without writing custom Kubernetes code.
* I want a consistent and portable solution to consume hardware devices
across k8s clusters.
## Objectives
# Objectives
1. Add support for vendor specific Devices in kubelet:
* Through a pluggable mechanism.
@ -103,16 +81,18 @@ _Kubernetes Meeting Notes On This:_
2. Define a deployment mechanism for this new API.
3. Define a versioning mechanism for this new API.
## Non Objectives
1. Advanced scheduling and resource selection (solved through [#782](https://github.com/Kubernetes/community/pull/782)).
# Non Objectives
1. Advanced scheduling and resource selection (solved through
[#782](https://github.com/Kubernetes/community/pull/782)).
We will only try to give basic selection primitives to the devices
2. Metrics: this should be the job of cadvisor and should probably either be
addressed there (cadvisor) or if people feel there is a case to be made
for it being addressed in the Device Plugin, in a follow up proposal.
## Stories
# Proposed Implementation 1
### Vendor story
## Vendor story
Kubernetes provides to vendors a mechanism called device plugins to:
* advertise devices.
@ -144,7 +124,7 @@ onwn gRPC server.
Only then will kubelet start interacting with the vendor's device plugin
through the gRPC apis.
### End User story
## End User story
When setting up the cluster the admin knows what kind of devices are present
on the different machines and therefore can select what devices they want to
@ -182,6 +162,7 @@ He might in the future be in charge of selecting the device.
## Device Plugin
### Introduction
The device plugin is structured in 5 parts:
1. Registration: The device plugin advertises it's presence to Kubelet
2. Discovery: Kubelet calls the device plugin to list it's devices
@ -189,7 +170,7 @@ The device plugin is structured in 5 parts:
devices advertised by the device plugin, Kubelet calls the device plugin's
`Allocate` and `Deallocate` functions.
4. Cleanup: Kubelet terminates the communication through a "Stop"
4. Heartbeat: The device plugin polls Kubelet to know if it's still alive
5. Heartbeat: The device plugin polls Kubelet to know if it's still alive
and if it has to re-issue a Register request
### Registration
@ -247,7 +228,7 @@ The device plugin is also expected to periodically call the `Heartbeat` function
exposed by Kubelet and issue a `Registration` request when it either can't reach
Kubelet or Kubelet answers with a `KO` response.
![Process](./device-plugin.png)
![Process](device-plugin.png)
### Protobuf specification
@ -343,7 +324,233 @@ message DeviceHealth {
}
```
## Installation
# Proposed Implementation 2
The main strategy of this proposed implemenation is that we want to start with
something simple that can show benefits on our immediate use cases, yet the
API design should be extendable to support future requirements.
Here are the main motivations for this alternative proposed implementation:
* Discovery phase: can we eliminate this gRPC procedure? It seems more
natural for device plugin to send Kubelet the discovered device information
right after device initialization and the registration gRPC procedure.
* The current implementation uses gRPC to communicate between Kubelet and
device plugin. Both Kubelet and device plugin need to start a gRPC server
for two-way communication. This seems a bit complicated. Can we have
device plugin send enough information to Kubelet so that we only need
Kubelet to start gRPC server and device plugin is kept as gRPC client?
The main concern with one-way gRPC communication is that we can NOT
support device specific operations, like reset device, during
allocation/deallocation. Depending on how long we expect device specific
operations to take, we can support this feature later by
having device plugin also provide a gRPC service or Device Plugin
can instruct Kubelet to perform device specific operation hooks.
* Do we need checkpointing in the initial prototype implementation?
Even in alpha release, we may still want to be able to recover from
various failure scenario. Otherwise, it would affect user experience.
Currently, it seems the only information we need to record somewhere
is what device is allocated to what pod/container. There have been
discussions on different ways and places to record this information.
The approach taken by the current implementation pushes this information
to ApiServer by extending NodeStatus interface between Kubelet and ApiServer.
The major concern on this approach is that it introduced an API extension
apart from the current model (Currently Node information recorded at ApiServer
only contains resource capacity information. Resource allocation information
is kept at Node). The second approach is for Kubelet to checkpoint this
information. This seems to align with the current Kubernetes model that
Kubelet is the component to implement allocation/deallocation functionalities.
The information we want to checkpoint, i.e., what device is allocated to what
pod/container, also seems generic enough to be implemented at Kubelet.
It may also allow other use cases outside device plugin, e.g., cpu pin.
The third approach is to implement this in device plugin. This way,
device plugin can also record any state information useful to its own
failure recovery in checkpoints. One concern on this approach is that it
may add more burdens on vendors to implement their device plugin images.
Surprises might happen in production if things were not implemented correctly.
It also seems apart from the current model as today Kubelet is the place
where allocation/deallocation happens for other types of resources.
* Heartbeat: do we need this to make sure connections can be re-established
between kubelet and device plugin? Can we reuse keepalive feature from gRPC?
Or if Kubelet checkpoints device allocation state information, device plugin
may only need to detect Kubelet failure when it needs to update device
information. Or can device plugin send periodic device state updates
(this may be needed anyway if we want to collect device usage stats)
and use that to detect Kubelet failure or device plugin failure?
## Device Plugin Lifecycle
![Process](device-plugin-2.png)
1. User or cluster admin push vendor-specific device plugin DaemonSets.
The DaemonSets YAML config includes mountPaths to the host directories
where device driver, user-space libraries, and tools will be installed.
2. After device plugin container is brought up, it detects the specific
types of HW devices. If such devices exist, it initializes these devices
and sets up the environments to access these devices (e.g., install
device drivers, user-space libraries, and tools).
3. After initialization, device plugin queries HW device states through the
installed device monitoring tools or other device interfaces. Then device
plugin connects to the Kubelet device plugin gRPC server and sends it the
obtained list of HW device information. In the initial prototype, the
device resource exported by a device plugin can be implemented as an
[extended OIR](https://github.com/kubernetes/kubernetes/pull/48922)
with special prefix “extensions.kubernetes.io/”, plus device resource_name
that uniquely identifies a device plugin on a node.
Kubelet can use existing API to add this resource to API server so that the
device resource is available for scheduling.
4. Device plugin runs in a loop to continuously query HW device states. If it
detects any changes, it sends the Kubelet device plugin gRPC server the new
list of HW device information. Kubelet can use this information to update its
device capacity states and if necessary, re-allocate new device to a user
container with unhealthy allocated devices.
5. When Kubelet receives an allocation request for a HW device advertised
by a device plugin (i.e., resource with “extensions.kubernetes.io/” prefix
plus device resource_name), it updates its internal allocation state,
issues certain calls to CRI to bind mount the host directories where
user-space libraries and tools are installed to the device-specific
default directories in user Pod or set up certain environment variables,
and checkpoints the container-to-device allocation information to persistent
storage.
6. When user container accessing the device finishes, Kubelet updates its
internal state to deallocate the device, and updates its checkpoint state
in persistent storage.
7. When device plugin DaemonSets is removed, clean up device state (e.g., uninstall
device driver, remove user-space libraries and tools). This step can be
specified as a preStop
[container lifecycle step](https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/).
Note one implication from this approach is that device plugin upgrade process
will be disruptive. It will need more thinkings if we want to
support non-disruptive upgrade process.
## Protobuf API
```go
service PluginResource {
rpc Register(RegisterRequest) returns (RegisterResponse) {}
rpc ReportDeviceInfo(ReportRequest) returns (ReportResponse) {}
}
message RegisterRequest {
// Version of the API the Device Plugin was built against
string version = 1;
// E.g., "nvidia-gpu". Used to construct OIR:
// “extensions.kubernetes.io/resourcename”.
string resourcename = 2;
}
message RegisterResponse {
bool success = 1;
// Kubelet fills this field with details if it encounters any errors
// during the registration process, e.g., for version mismatch, what
// is the required version and minimum supported version by kubelet.
string error = 2;
}
message ReportRequest {
repeated DeviceInfo devices = 1;
}
message DeviceInfo {
// E.g., "GPU-fef8089b-4820-abfc-e83e-94318197576e".
// Needs to be unique per device plugin.
string Id = 1;
// E.g., UNKNOWN, HEALTHY, UNHEALTHY.
enum State = 2;
// E.g., {"/rootfs/nvidia":"/usr/local/nvidia"}
// Maps from host directory where device library or tools
// are installed to user pod directory where the library or
// tools are expected to be accessed. Kubelet will use this
// information to bind mount host directory to the user pod
// directory during allocation.
map<string, string> mountpaths = 3;
// E.g., {"LD_PRELOAD":"xxx.so"}. Kubelet will export these
// env variables in user pod during allocation.
map<string, string> envariables = 4;
// E.g., {"Family":"Pascal"} {"ECC":"True"}
// These fields can be used as node labels for selection
map<string, string> labels = 4;
}
message ReportResponse {
bool success = 1;
// Kubelet fills this field if it encounters any errors
// during the report process, e.g., device plugin hasnt
// registered yet (could happen when Kubelet restarts).
string error = 2;
}
```
## Failure recovery
* Device failure: Device plugin should be able to detect device failure and
report that to Kubelet. Kubelet should then remove the failed device from
available list. If there is any user container using the failed device,
Kubelet may terminate the user container and reschedule it on a good
available device. When a failed device recovers, device plugin will send
Kubelet the updated device state and Kubelet can add the device to the
available device list.
* Kubelet crash: When Kubelet restarts after a crash, it should be able to
recover allocation states from the checkpoints recorded on persistent storage.
The checkpoint records should include allocated device id to pod mapping
information as well as non-allocated device information, so Kubelet can
re-establish precise allocation state. Device plugin should be able to
detect Kubelet failure when it needs to update device informaiton,
and re-registers with the new Kubelet.
* Device plugin crash: A device plugin is deployed through DaemonSets.
If a device plugin process fails, Kubelet will detect that and automatically
restart it. After restart, device plugin will reconnect to Kubelet and
report the current device states. Kubelet can compare the reported device
information with its internal device states, and makes adjustments if
necessary. One thing we need to pay special attention is that device plugin
may fail at any time, e.g., during initialization. When the new device plugin
process starts, it needs to be able to recover from incomplete states.
## Roadmap
* Phase 1: device plugin is supported in alpha mode in 1.8 kubernetes release.
Make sure it provides the following functionalities: initialize, discover,
allocate/deallocate, cleanup, basic health check, and can recover from device,
Kubelet or device plugin failures . Make sure the interface is kept simple and
extensible, and the document is clear. With the initial implemented API,
make sure we can use the interface to implement device plugin images for at
least two types of devices: Nvidia GPU and Solarflare NIC. Note the support
for Nvidia GPU will help gpu support to enter beta by providing a general
and extendable api. Test coverage: e2e tests with the developed Nvidia GPU
image and Solarflare image to make sure these devices can be correctly
initialized, allocated, deallocated, and cleaned up. Also should test we
can recover from device failure, Kubelet restarts, and device plugin failure.
* Phase 2: device plugin is supported in beta mode in 1.9 kubernetes release.
At this phase, the primary design and API should be stabilized. We need to
implement authentication mechanism to ensure only trusted device plugin
images can be registered. We can support device specific
allocation/deallocation requests by having device plugin also provide a gRPC
service or Device Plugin can instruct Kubelet to perform device specific
operation hooks during allocation/deallocation procedures.
Hopefully at this time, we may make good progress on supporting more flexible
resource allocation policies in Kubernetes, and with that, we can switch
device plugin from using OIR to using ResourceClass to allow more efficient
HW specific resource allocations, e.g., topology aware resource allocations,
NUMA aware resource allocations etc.
* Phase 3: device plugin is supported in GA mode in 1.10 kubernetes release.
Device plugin should have clear error handling and problem report that
allows easy debuggability and monitoring on its exported devices.
We should have clear documentation on how to develop a device plugin and
how interact with device plugin. The framework needs to be stable and
demonstrate good user experiences through the support on multiple types
of devices, such as GPU, Infiniband, high-performance NIC, and etc.
## Open Questions
* The proposal assumes we can omit device specific allocation/deallocation
operations in the alpha release and support this feature in later releases.
If people are concerned that such omission would impact the usability of
alpha release, we will need to come up with a solution that would either
require two-way gRPC communication between Kubelet and Device Plugin or
Device Plugin can instruct Kubelet to perform device specific operation hooks
during allocation/deallocation procedures.
# Installation
The installation process should be straightforward to the user, transparent
and similar to other regular Kubernetes actions.
@ -366,6 +573,7 @@ as `kubeadm` they would use the examples that we would provide at:
`https://github.com/Kubernetes/Kubernetes/tree/master/examples/device-plugin.yaml`
YAML example:
```yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
@ -388,48 +596,6 @@ spec:
path: /var/run/Kubernetes
```
## API Changes
### Device
When discovering the devices, Kubelet will be in charge of advertising those
resources to the API server.
We will advertise each device returned by the Device Plugin in a new structure
called `Device`.
It is defined as follows:
```golang
type Device struct {
Kind string
Vendor string
Name string
Health DeviceHealthStatus
Properties map[string]string
}
```
Because the current API (Capacity) can not be extended to support Device,
we will need to create two new attributes in the NodeStatus structure:
* `DevCapacity`: Describing the device capacity of the node
* `DevAvailable`: Describing the available devices
```golang
type NodeStatus struct {
DevCapacity []Device
DevAvailable []Device
}
```
We also introduce the `Allocated` field in the pod's status so that user
can know what devices were assigned to the pod. It could also be useful in
the case of monitoring
```golang
type ContainerStatus struct {
Devices []Device
}
```
# Versioning
Currently there is only one part (CRI) of Kubernetes which is based on
@ -469,3 +635,12 @@ Negotiation would take place in the registration:
contacts the Device Plugin
4. If the Device Plugin supports the version sent by Kubelet it can and should
answer the different calls made by Kubelet
## References
* [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
* [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
* [Kubernetes Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
* [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)