Device plugin proposal patch by Jiaying
This commit is contained in:
parent
9d7a245a23
commit
4cbc77e491
Binary file not shown.
|
After Width: | Height: | Size: 67 KiB |
|
|
@ -1,99 +1,77 @@
|
||||||
# Device Manager Proposal
|
Device Manager Proposal
|
||||||
|
===============
|
||||||
|
|
||||||
1. [Abstract](#abstract)
|
<!-- BEGIN MUNGE: GENERATED_TOC -->
|
||||||
2. [Motivation](#motivation)
|
|
||||||
3. [Use Cases](#use-cases)
|
- [Motivation](#motivation)
|
||||||
4. [Objectives](#objectives)
|
- [Use Cases](#use-cases)
|
||||||
5. [Non Objectives](#non-objectives)
|
- [Objectives](#objectives)
|
||||||
6. [Stories](#stories)
|
- [Non Objectives](#non-objectives)
|
||||||
* [Vendor story](#vendor-story)
|
- [Proposed Implementation 1](#proposed-implementation-1)
|
||||||
* [User story](#user-story)
|
- [Vendor story](#vendor-story)
|
||||||
8. [Device Plugin](#device-plugin)
|
- [End User story](#end-user-story)
|
||||||
* [Protocol Overview](#protocol-overview)
|
- [Device Plugin](#device-plugin)
|
||||||
* [Protobuf specification](#protobuf-specification)
|
- [Introduction](#introduction)
|
||||||
* [Installation](#installation)
|
- [Registration](#registration)
|
||||||
* [API Changes](#api-changes)
|
- [Unix Socket](#unix-socket)
|
||||||
* [Versioning](#versioning)
|
- [Protocol Overview](#protocol-overview)
|
||||||
|
- [Protobuf specification](#protobuf-specification)
|
||||||
|
- [Proposed Implementation 2](#proposed-implementation-2)
|
||||||
|
- [Device Plugin Lifecycle](#device-plugin-lifecycle)
|
||||||
|
- [Protobuf API](#protobuf-api)
|
||||||
|
- [Failure recovery](#failure-recovery)
|
||||||
|
- [Roadmap](#roadmap)
|
||||||
|
- [Open Questions](#open-questions-1)
|
||||||
|
- [Installation](#installation)
|
||||||
|
- [Versioning](#versioning)
|
||||||
|
- [References](#references)
|
||||||
|
|
||||||
|
<!-- END MUNGE: GENERATED_TOC -->
|
||||||
|
|
||||||
_Authors:_
|
_Authors:_
|
||||||
|
|
||||||
* @RenaudWasTaken - Renaud Gaubert <rgaubert@NVIDIA.com>
|
* @RenaudWasTaken - Renaud Gaubert <rgaubert@NVIDIA.com>
|
||||||
|
|
||||||
## Abstract
|
# Motivation
|
||||||
|
|
||||||
This document describes a vendor independant solution to:
|
|
||||||
* Discovering and representing external devices
|
|
||||||
* Making these devices available to the container and cleaning them up
|
|
||||||
afterwards
|
|
||||||
* Health Check of these devices
|
|
||||||
|
|
||||||
Because devices are vendor dependant and have their own sets of problems
|
|
||||||
and mechanisms, the solution we describe is a plugin mechanism managed by
|
|
||||||
Kubelet.
|
|
||||||
|
|
||||||
At their core, device plugins are simple gRPC servers that may run in a
|
|
||||||
container deployed through the pod mechanism.
|
|
||||||
|
|
||||||
These servers implement the gRPC interface defined later in this design
|
|
||||||
document and once the device plugin makes itself know to kubelet, kubelet
|
|
||||||
will interact with the device through three simple functions:
|
|
||||||
1. A `Discover` function for the kubelet to Discover the devices and
|
|
||||||
their properties.
|
|
||||||
2. An `Allocate` and `Deallocate` function which are called respectively
|
|
||||||
before container creation and after container deletion with the
|
|
||||||
devices to allocate and deallocate.
|
|
||||||
3. A `Monitor` function to notify Kubelet whenever a device becomes
|
|
||||||
unhealthy.
|
|
||||||
|
|
||||||
The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
|
|
||||||
the simple following steps:
|
|
||||||
* `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
|
|
||||||
* When launching `kubectl describe nodes`, the devices appear in the node spec
|
|
||||||
* In the long term users will be able to select them through Resource Class
|
|
||||||
|
|
||||||
We expect the plugins to be deployed across the clusters through DaemonSets.
|
|
||||||
The targeted devices are GPUs, NICs, FPGAs, InfiniBand, Storage devices, ....
|
|
||||||
|
|
||||||
|
|
||||||
## Motivation
|
|
||||||
|
|
||||||
Kubernetes currently supports discovery of CPU and Memory primarily to a
|
Kubernetes currently supports discovery of CPU and Memory primarily to a
|
||||||
minimal extent. Very few devices are handled natively by Kubelet.
|
minimal extent. Very few devices are handled natively by Kubelet.
|
||||||
|
|
||||||
It is not a sustainable solution to expect every vendor to add their vendor
|
It is not a sustainable solution to expect every vendor to add their vendor
|
||||||
specific code inside Kubernetes. This approach does not scale and is not
|
specific code inside Kubernetes to make their devices usable.
|
||||||
portable.
|
Instead, we want a solution for vendors to be able to advertise their resources
|
||||||
|
to Kubelet and monitor them without writing custom Kubernetes code.
|
||||||
|
We also want to provide a consistent and portable solution for users to
|
||||||
|
consume hardware devices across k8s clusters.
|
||||||
|
|
||||||
We want a solution for those vendors to be able to advertise their resources
|
This document describes a vendor independant solution to:
|
||||||
to kubelet and monitor them.
|
* Discovering and representing external devices
|
||||||
We also want a way for the user to specify which resource their jobs will use
|
* Making these devices available to the containers using these devices and
|
||||||
and what constraints are associated to these resources.
|
cleaning them up afterwards
|
||||||
|
* Monitoring these devices
|
||||||
|
|
||||||
In order to solve this problem it is obvious that we need a plugin system in
|
Because devices are vendor dependant and have their own sets of problems
|
||||||
order to have vendors advertise and monitor their resources on behalf
|
and mechanisms, the solution we describe is a plugin mechanism that may run
|
||||||
of Kubelet.
|
in a container deployed through the DaemonSets mechanism.
|
||||||
|
The targeted devices include GPUs, High-performance NICs, FPGAs, InfiniBand,
|
||||||
|
Storage devices, and other similar computing resources that require vendor
|
||||||
|
specific initialization and setup.
|
||||||
|
|
||||||
Additionally, we introduce the concept of Device to be able to select
|
The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
|
||||||
resources with constraints in a pod spec.
|
the following simple steps:
|
||||||
|
* `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
|
||||||
|
* When launching `kubectl describe nodes`, the devices appear in the node spec
|
||||||
|
* In the long term users will be able to select them through Resource Class
|
||||||
|
|
||||||
_GPU Integration Example:_
|
# Use Cases
|
||||||
* [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
|
|
||||||
* [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
|
|
||||||
|
|
||||||
_Kubernetes Meeting Notes On This:_
|
* I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
|
||||||
* [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
|
in my pod.
|
||||||
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
|
* I should be able to use that device without writing custom Kubernetes code.
|
||||||
* [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
|
* I want a consistent and portable solution to consume hardware devices
|
||||||
|
across k8s clusters.
|
||||||
|
|
||||||
## Use Cases
|
# Objectives
|
||||||
|
|
||||||
* I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
|
|
||||||
in my pod.
|
|
||||||
* I should be able to use that device without writing custom Kubernetes code.
|
|
||||||
* I want a consistent and portable solution to consume hardware devices
|
|
||||||
across k8s clusters.
|
|
||||||
|
|
||||||
## Objectives
|
|
||||||
|
|
||||||
1. Add support for vendor specific Devices in kubelet:
|
1. Add support for vendor specific Devices in kubelet:
|
||||||
* Through a pluggable mechanism.
|
* Through a pluggable mechanism.
|
||||||
|
|
@ -103,16 +81,18 @@ _Kubernetes Meeting Notes On This:_
|
||||||
2. Define a deployment mechanism for this new API.
|
2. Define a deployment mechanism for this new API.
|
||||||
3. Define a versioning mechanism for this new API.
|
3. Define a versioning mechanism for this new API.
|
||||||
|
|
||||||
## Non Objectives
|
# Non Objectives
|
||||||
1. Advanced scheduling and resource selection (solved through [#782](https://github.com/Kubernetes/community/pull/782)).
|
|
||||||
|
1. Advanced scheduling and resource selection (solved through
|
||||||
|
[#782](https://github.com/Kubernetes/community/pull/782)).
|
||||||
We will only try to give basic selection primitives to the devices
|
We will only try to give basic selection primitives to the devices
|
||||||
2. Metrics: this should be the job of cadvisor and should probably either be
|
2. Metrics: this should be the job of cadvisor and should probably either be
|
||||||
addressed there (cadvisor) or if people feel there is a case to be made
|
addressed there (cadvisor) or if people feel there is a case to be made
|
||||||
for it being addressed in the Device Plugin, in a follow up proposal.
|
for it being addressed in the Device Plugin, in a follow up proposal.
|
||||||
|
|
||||||
## Stories
|
# Proposed Implementation 1
|
||||||
|
|
||||||
### Vendor story
|
## Vendor story
|
||||||
|
|
||||||
Kubernetes provides to vendors a mechanism called device plugins to:
|
Kubernetes provides to vendors a mechanism called device plugins to:
|
||||||
* advertise devices.
|
* advertise devices.
|
||||||
|
|
@ -144,7 +124,7 @@ onwn gRPC server.
|
||||||
Only then will kubelet start interacting with the vendor's device plugin
|
Only then will kubelet start interacting with the vendor's device plugin
|
||||||
through the gRPC apis.
|
through the gRPC apis.
|
||||||
|
|
||||||
### End User story
|
## End User story
|
||||||
|
|
||||||
When setting up the cluster the admin knows what kind of devices are present
|
When setting up the cluster the admin knows what kind of devices are present
|
||||||
on the different machines and therefore can select what devices they want to
|
on the different machines and therefore can select what devices they want to
|
||||||
|
|
@ -182,6 +162,7 @@ He might in the future be in charge of selecting the device.
|
||||||
## Device Plugin
|
## Device Plugin
|
||||||
|
|
||||||
### Introduction
|
### Introduction
|
||||||
|
|
||||||
The device plugin is structured in 5 parts:
|
The device plugin is structured in 5 parts:
|
||||||
1. Registration: The device plugin advertises it's presence to Kubelet
|
1. Registration: The device plugin advertises it's presence to Kubelet
|
||||||
2. Discovery: Kubelet calls the device plugin to list it's devices
|
2. Discovery: Kubelet calls the device plugin to list it's devices
|
||||||
|
|
@ -189,7 +170,7 @@ The device plugin is structured in 5 parts:
|
||||||
devices advertised by the device plugin, Kubelet calls the device plugin's
|
devices advertised by the device plugin, Kubelet calls the device plugin's
|
||||||
`Allocate` and `Deallocate` functions.
|
`Allocate` and `Deallocate` functions.
|
||||||
4. Cleanup: Kubelet terminates the communication through a "Stop"
|
4. Cleanup: Kubelet terminates the communication through a "Stop"
|
||||||
4. Heartbeat: The device plugin polls Kubelet to know if it's still alive
|
5. Heartbeat: The device plugin polls Kubelet to know if it's still alive
|
||||||
and if it has to re-issue a Register request
|
and if it has to re-issue a Register request
|
||||||
|
|
||||||
### Registration
|
### Registration
|
||||||
|
|
@ -247,7 +228,7 @@ The device plugin is also expected to periodically call the `Heartbeat` function
|
||||||
exposed by Kubelet and issue a `Registration` request when it either can't reach
|
exposed by Kubelet and issue a `Registration` request when it either can't reach
|
||||||
Kubelet or Kubelet answers with a `KO` response.
|
Kubelet or Kubelet answers with a `KO` response.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
|
|
||||||
### Protobuf specification
|
### Protobuf specification
|
||||||
|
|
@ -343,7 +324,233 @@ message DeviceHealth {
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Installation
|
# Proposed Implementation 2
|
||||||
|
|
||||||
|
The main strategy of this proposed implemenation is that we want to start with
|
||||||
|
something simple that can show benefits on our immediate use cases, yet the
|
||||||
|
API design should be extendable to support future requirements.
|
||||||
|
Here are the main motivations for this alternative proposed implementation:
|
||||||
|
|
||||||
|
* Discovery phase: can we eliminate this gRPC procedure? It seems more
|
||||||
|
natural for device plugin to send Kubelet the discovered device information
|
||||||
|
right after device initialization and the registration gRPC procedure.
|
||||||
|
* The current implementation uses gRPC to communicate between Kubelet and
|
||||||
|
device plugin. Both Kubelet and device plugin need to start a gRPC server
|
||||||
|
for two-way communication. This seems a bit complicated. Can we have
|
||||||
|
device plugin send enough information to Kubelet so that we only need
|
||||||
|
Kubelet to start gRPC server and device plugin is kept as gRPC client?
|
||||||
|
The main concern with one-way gRPC communication is that we can NOT
|
||||||
|
support device specific operations, like reset device, during
|
||||||
|
allocation/deallocation. Depending on how long we expect device specific
|
||||||
|
operations to take, we can support this feature later by
|
||||||
|
having device plugin also provide a gRPC service or Device Plugin
|
||||||
|
can instruct Kubelet to perform device specific operation hooks.
|
||||||
|
* Do we need checkpointing in the initial prototype implementation?
|
||||||
|
Even in alpha release, we may still want to be able to recover from
|
||||||
|
various failure scenario. Otherwise, it would affect user experience.
|
||||||
|
Currently, it seems the only information we need to record somewhere
|
||||||
|
is what device is allocated to what pod/container. There have been
|
||||||
|
discussions on different ways and places to record this information.
|
||||||
|
The approach taken by the current implementation pushes this information
|
||||||
|
to ApiServer by extending NodeStatus interface between Kubelet and ApiServer.
|
||||||
|
The major concern on this approach is that it introduced an API extension
|
||||||
|
apart from the current model (Currently Node information recorded at ApiServer
|
||||||
|
only contains resource capacity information. Resource allocation information
|
||||||
|
is kept at Node). The second approach is for Kubelet to checkpoint this
|
||||||
|
information. This seems to align with the current Kubernetes model that
|
||||||
|
Kubelet is the component to implement allocation/deallocation functionalities.
|
||||||
|
The information we want to checkpoint, i.e., what device is allocated to what
|
||||||
|
pod/container, also seems generic enough to be implemented at Kubelet.
|
||||||
|
It may also allow other use cases outside device plugin, e.g., cpu pin.
|
||||||
|
The third approach is to implement this in device plugin. This way,
|
||||||
|
device plugin can also record any state information useful to its own
|
||||||
|
failure recovery in checkpoints. One concern on this approach is that it
|
||||||
|
may add more burdens on vendors to implement their device plugin images.
|
||||||
|
Surprises might happen in production if things were not implemented correctly.
|
||||||
|
It also seems apart from the current model as today Kubelet is the place
|
||||||
|
where allocation/deallocation happens for other types of resources.
|
||||||
|
* Heartbeat: do we need this to make sure connections can be re-established
|
||||||
|
between kubelet and device plugin? Can we reuse keepalive feature from gRPC?
|
||||||
|
Or if Kubelet checkpoints device allocation state information, device plugin
|
||||||
|
may only need to detect Kubelet failure when it needs to update device
|
||||||
|
information. Or can device plugin send periodic device state updates
|
||||||
|
(this may be needed anyway if we want to collect device usage stats)
|
||||||
|
and use that to detect Kubelet failure or device plugin failure?
|
||||||
|
|
||||||
|
## Device Plugin Lifecycle
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
1. User or cluster admin push vendor-specific device plugin DaemonSets.
|
||||||
|
The DaemonSets YAML config includes mountPaths to the host directories
|
||||||
|
where device driver, user-space libraries, and tools will be installed.
|
||||||
|
2. After device plugin container is brought up, it detects the specific
|
||||||
|
types of HW devices. If such devices exist, it initializes these devices
|
||||||
|
and sets up the environments to access these devices (e.g., install
|
||||||
|
device drivers, user-space libraries, and tools).
|
||||||
|
3. After initialization, device plugin queries HW device states through the
|
||||||
|
installed device monitoring tools or other device interfaces. Then device
|
||||||
|
plugin connects to the Kubelet device plugin gRPC server and sends it the
|
||||||
|
obtained list of HW device information. In the initial prototype, the
|
||||||
|
device resource exported by a device plugin can be implemented as an
|
||||||
|
[extended OIR](https://github.com/kubernetes/kubernetes/pull/48922)
|
||||||
|
with special prefix “extensions.kubernetes.io/”, plus device resource_name
|
||||||
|
that uniquely identifies a device plugin on a node.
|
||||||
|
Kubelet can use existing API to add this resource to API server so that the
|
||||||
|
device resource is available for scheduling.
|
||||||
|
4. Device plugin runs in a loop to continuously query HW device states. If it
|
||||||
|
detects any changes, it sends the Kubelet device plugin gRPC server the new
|
||||||
|
list of HW device information. Kubelet can use this information to update its
|
||||||
|
device capacity states and if necessary, re-allocate new device to a user
|
||||||
|
container with unhealthy allocated devices.
|
||||||
|
5. When Kubelet receives an allocation request for a HW device advertised
|
||||||
|
by a device plugin (i.e., resource with “extensions.kubernetes.io/” prefix
|
||||||
|
plus device resource_name), it updates its internal allocation state,
|
||||||
|
issues certain calls to CRI to bind mount the host directories where
|
||||||
|
user-space libraries and tools are installed to the device-specific
|
||||||
|
default directories in user Pod or set up certain environment variables,
|
||||||
|
and checkpoints the container-to-device allocation information to persistent
|
||||||
|
storage.
|
||||||
|
6. When user container accessing the device finishes, Kubelet updates its
|
||||||
|
internal state to deallocate the device, and updates its checkpoint state
|
||||||
|
in persistent storage.
|
||||||
|
7. When device plugin DaemonSets is removed, clean up device state (e.g., uninstall
|
||||||
|
device driver, remove user-space libraries and tools). This step can be
|
||||||
|
specified as a preStop
|
||||||
|
[container lifecycle step](https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/).
|
||||||
|
Note one implication from this approach is that device plugin upgrade process
|
||||||
|
will be disruptive. It will need more thinkings if we want to
|
||||||
|
support non-disruptive upgrade process.
|
||||||
|
|
||||||
|
## Protobuf API
|
||||||
|
|
||||||
|
```go
|
||||||
|
|
||||||
|
service PluginResource {
|
||||||
|
rpc Register(RegisterRequest) returns (RegisterResponse) {}
|
||||||
|
rpc ReportDeviceInfo(ReportRequest) returns (ReportResponse) {}
|
||||||
|
}
|
||||||
|
|
||||||
|
message RegisterRequest {
|
||||||
|
// Version of the API the Device Plugin was built against
|
||||||
|
string version = 1;
|
||||||
|
// E.g., "nvidia-gpu". Used to construct OIR:
|
||||||
|
// “extensions.kubernetes.io/resourcename”.
|
||||||
|
string resourcename = 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
message RegisterResponse {
|
||||||
|
bool success = 1;
|
||||||
|
// Kubelet fills this field with details if it encounters any errors
|
||||||
|
// during the registration process, e.g., for version mismatch, what
|
||||||
|
// is the required version and minimum supported version by kubelet.
|
||||||
|
string error = 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
message ReportRequest {
|
||||||
|
repeated DeviceInfo devices = 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
message DeviceInfo {
|
||||||
|
// E.g., "GPU-fef8089b-4820-abfc-e83e-94318197576e".
|
||||||
|
// Needs to be unique per device plugin.
|
||||||
|
string Id = 1;
|
||||||
|
// E.g., UNKNOWN, HEALTHY, UNHEALTHY.
|
||||||
|
enum State = 2;
|
||||||
|
// E.g., {"/rootfs/nvidia":"/usr/local/nvidia"}
|
||||||
|
// Maps from host directory where device library or tools
|
||||||
|
// are installed to user pod directory where the library or
|
||||||
|
// tools are expected to be accessed. Kubelet will use this
|
||||||
|
// information to bind mount host directory to the user pod
|
||||||
|
// directory during allocation.
|
||||||
|
map<string, string> mountpaths = 3;
|
||||||
|
// E.g., {"LD_PRELOAD":"xxx.so"}. Kubelet will export these
|
||||||
|
// env variables in user pod during allocation.
|
||||||
|
map<string, string> envariables = 4;
|
||||||
|
// E.g., {"Family":"Pascal"} {"ECC":"True"}
|
||||||
|
// These fields can be used as node labels for selection
|
||||||
|
map<string, string> labels = 4;
|
||||||
|
}
|
||||||
|
|
||||||
|
message ReportResponse {
|
||||||
|
bool success = 1;
|
||||||
|
// Kubelet fills this field if it encounters any errors
|
||||||
|
// during the report process, e.g., device plugin hasn’t
|
||||||
|
// registered yet (could happen when Kubelet restarts).
|
||||||
|
string error = 2;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Failure recovery
|
||||||
|
|
||||||
|
* Device failure: Device plugin should be able to detect device failure and
|
||||||
|
report that to Kubelet. Kubelet should then remove the failed device from
|
||||||
|
available list. If there is any user container using the failed device,
|
||||||
|
Kubelet may terminate the user container and reschedule it on a good
|
||||||
|
available device. When a failed device recovers, device plugin will send
|
||||||
|
Kubelet the updated device state and Kubelet can add the device to the
|
||||||
|
available device list.
|
||||||
|
* Kubelet crash: When Kubelet restarts after a crash, it should be able to
|
||||||
|
recover allocation states from the checkpoints recorded on persistent storage.
|
||||||
|
The checkpoint records should include allocated device id to pod mapping
|
||||||
|
information as well as non-allocated device information, so Kubelet can
|
||||||
|
re-establish precise allocation state. Device plugin should be able to
|
||||||
|
detect Kubelet failure when it needs to update device informaiton,
|
||||||
|
and re-registers with the new Kubelet.
|
||||||
|
* Device plugin crash: A device plugin is deployed through DaemonSets.
|
||||||
|
If a device plugin process fails, Kubelet will detect that and automatically
|
||||||
|
restart it. After restart, device plugin will reconnect to Kubelet and
|
||||||
|
report the current device states. Kubelet can compare the reported device
|
||||||
|
information with its internal device states, and makes adjustments if
|
||||||
|
necessary. One thing we need to pay special attention is that device plugin
|
||||||
|
may fail at any time, e.g., during initialization. When the new device plugin
|
||||||
|
process starts, it needs to be able to recover from incomplete states.
|
||||||
|
|
||||||
|
## Roadmap
|
||||||
|
|
||||||
|
* Phase 1: device plugin is supported in alpha mode in 1.8 kubernetes release.
|
||||||
|
Make sure it provides the following functionalities: initialize, discover,
|
||||||
|
allocate/deallocate, cleanup, basic health check, and can recover from device,
|
||||||
|
Kubelet or device plugin failures . Make sure the interface is kept simple and
|
||||||
|
extensible, and the document is clear. With the initial implemented API,
|
||||||
|
make sure we can use the interface to implement device plugin images for at
|
||||||
|
least two types of devices: Nvidia GPU and Solarflare NIC. Note the support
|
||||||
|
for Nvidia GPU will help gpu support to enter beta by providing a general
|
||||||
|
and extendable api. Test coverage: e2e tests with the developed Nvidia GPU
|
||||||
|
image and Solarflare image to make sure these devices can be correctly
|
||||||
|
initialized, allocated, deallocated, and cleaned up. Also should test we
|
||||||
|
can recover from device failure, Kubelet restarts, and device plugin failure.
|
||||||
|
* Phase 2: device plugin is supported in beta mode in 1.9 kubernetes release.
|
||||||
|
At this phase, the primary design and API should be stabilized. We need to
|
||||||
|
implement authentication mechanism to ensure only trusted device plugin
|
||||||
|
images can be registered. We can support device specific
|
||||||
|
allocation/deallocation requests by having device plugin also provide a gRPC
|
||||||
|
service or Device Plugin can instruct Kubelet to perform device specific
|
||||||
|
operation hooks during allocation/deallocation procedures.
|
||||||
|
Hopefully at this time, we may make good progress on supporting more flexible
|
||||||
|
resource allocation policies in Kubernetes, and with that, we can switch
|
||||||
|
device plugin from using OIR to using ResourceClass to allow more efficient
|
||||||
|
HW specific resource allocations, e.g., topology aware resource allocations,
|
||||||
|
NUMA aware resource allocations etc.
|
||||||
|
* Phase 3: device plugin is supported in GA mode in 1.10 kubernetes release.
|
||||||
|
Device plugin should have clear error handling and problem report that
|
||||||
|
allows easy debuggability and monitoring on its exported devices.
|
||||||
|
We should have clear documentation on how to develop a device plugin and
|
||||||
|
how interact with device plugin. The framework needs to be stable and
|
||||||
|
demonstrate good user experiences through the support on multiple types
|
||||||
|
of devices, such as GPU, Infiniband, high-performance NIC, and etc.
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
* The proposal assumes we can omit device specific allocation/deallocation
|
||||||
|
operations in the alpha release and support this feature in later releases.
|
||||||
|
If people are concerned that such omission would impact the usability of
|
||||||
|
alpha release, we will need to come up with a solution that would either
|
||||||
|
require two-way gRPC communication between Kubelet and Device Plugin or
|
||||||
|
Device Plugin can instruct Kubelet to perform device specific operation hooks
|
||||||
|
during allocation/deallocation procedures.
|
||||||
|
|
||||||
|
# Installation
|
||||||
|
|
||||||
The installation process should be straightforward to the user, transparent
|
The installation process should be straightforward to the user, transparent
|
||||||
and similar to other regular Kubernetes actions.
|
and similar to other regular Kubernetes actions.
|
||||||
|
|
@ -366,6 +573,7 @@ as `kubeadm` they would use the examples that we would provide at:
|
||||||
`https://github.com/Kubernetes/Kubernetes/tree/master/examples/device-plugin.yaml`
|
`https://github.com/Kubernetes/Kubernetes/tree/master/examples/device-plugin.yaml`
|
||||||
|
|
||||||
YAML example:
|
YAML example:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
apiVersion: extensions/v1beta1
|
apiVersion: extensions/v1beta1
|
||||||
kind: DaemonSet
|
kind: DaemonSet
|
||||||
|
|
@ -388,48 +596,6 @@ spec:
|
||||||
path: /var/run/Kubernetes
|
path: /var/run/Kubernetes
|
||||||
```
|
```
|
||||||
|
|
||||||
## API Changes
|
|
||||||
### Device
|
|
||||||
|
|
||||||
When discovering the devices, Kubelet will be in charge of advertising those
|
|
||||||
resources to the API server.
|
|
||||||
|
|
||||||
We will advertise each device returned by the Device Plugin in a new structure
|
|
||||||
called `Device`.
|
|
||||||
It is defined as follows:
|
|
||||||
|
|
||||||
```golang
|
|
||||||
type Device struct {
|
|
||||||
Kind string
|
|
||||||
Vendor string
|
|
||||||
Name string
|
|
||||||
Health DeviceHealthStatus
|
|
||||||
Properties map[string]string
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Because the current API (Capacity) can not be extended to support Device,
|
|
||||||
we will need to create two new attributes in the NodeStatus structure:
|
|
||||||
* `DevCapacity`: Describing the device capacity of the node
|
|
||||||
* `DevAvailable`: Describing the available devices
|
|
||||||
|
|
||||||
```golang
|
|
||||||
type NodeStatus struct {
|
|
||||||
DevCapacity []Device
|
|
||||||
DevAvailable []Device
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
We also introduce the `Allocated` field in the pod's status so that user
|
|
||||||
can know what devices were assigned to the pod. It could also be useful in
|
|
||||||
the case of monitoring
|
|
||||||
|
|
||||||
```golang
|
|
||||||
type ContainerStatus struct {
|
|
||||||
Devices []Device
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
# Versioning
|
# Versioning
|
||||||
|
|
||||||
Currently there is only one part (CRI) of Kubernetes which is based on
|
Currently there is only one part (CRI) of Kubernetes which is based on
|
||||||
|
|
@ -469,3 +635,12 @@ Negotiation would take place in the registration:
|
||||||
contacts the Device Plugin
|
contacts the Device Plugin
|
||||||
4. If the Device Plugin supports the version sent by Kubelet it can and should
|
4. If the Device Plugin supports the version sent by Kubelet it can and should
|
||||||
answer the different calls made by Kubelet
|
answer the different calls made by Kubelet
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
* [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
|
||||||
|
* [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
|
||||||
|
* [Kubernetes Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
|
||||||
|
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
|
||||||
|
* [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue