Added Device plugin proposal
This commit is contained in:
parent
c426590350
commit
9d7a245a23
|
|
@ -0,0 +1,471 @@
|
||||||
|
# Device Manager Proposal
|
||||||
|
|
||||||
|
1. [Abstract](#abstract)
|
||||||
|
2. [Motivation](#motivation)
|
||||||
|
3. [Use Cases](#use-cases)
|
||||||
|
4. [Objectives](#objectives)
|
||||||
|
5. [Non Objectives](#non-objectives)
|
||||||
|
6. [Stories](#stories)
|
||||||
|
* [Vendor story](#vendor-story)
|
||||||
|
* [User story](#user-story)
|
||||||
|
8. [Device Plugin](#device-plugin)
|
||||||
|
* [Protocol Overview](#protocol-overview)
|
||||||
|
* [Protobuf specification](#protobuf-specification)
|
||||||
|
* [Installation](#installation)
|
||||||
|
* [API Changes](#api-changes)
|
||||||
|
* [Versioning](#versioning)
|
||||||
|
|
||||||
|
_Authors:_
|
||||||
|
|
||||||
|
* @RenaudWasTaken - Renaud Gaubert <rgaubert@NVIDIA.com>
|
||||||
|
|
||||||
|
## Abstract
|
||||||
|
|
||||||
|
This document describes a vendor independant solution to:
|
||||||
|
* Discovering and representing external devices
|
||||||
|
* Making these devices available to the container and cleaning them up
|
||||||
|
afterwards
|
||||||
|
* Health Check of these devices
|
||||||
|
|
||||||
|
Because devices are vendor dependant and have their own sets of problems
|
||||||
|
and mechanisms, the solution we describe is a plugin mechanism managed by
|
||||||
|
Kubelet.
|
||||||
|
|
||||||
|
At their core, device plugins are simple gRPC servers that may run in a
|
||||||
|
container deployed through the pod mechanism.
|
||||||
|
|
||||||
|
These servers implement the gRPC interface defined later in this design
|
||||||
|
document and once the device plugin makes itself know to kubelet, kubelet
|
||||||
|
will interact with the device through three simple functions:
|
||||||
|
1. A `Discover` function for the kubelet to Discover the devices and
|
||||||
|
their properties.
|
||||||
|
2. An `Allocate` and `Deallocate` function which are called respectively
|
||||||
|
before container creation and after container deletion with the
|
||||||
|
devices to allocate and deallocate.
|
||||||
|
3. A `Monitor` function to notify Kubelet whenever a device becomes
|
||||||
|
unhealthy.
|
||||||
|
|
||||||
|
The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
|
||||||
|
the simple following steps:
|
||||||
|
* `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
|
||||||
|
* When launching `kubectl describe nodes`, the devices appear in the node spec
|
||||||
|
* In the long term users will be able to select them through Resource Class
|
||||||
|
|
||||||
|
We expect the plugins to be deployed across the clusters through DaemonSets.
|
||||||
|
The targeted devices are GPUs, NICs, FPGAs, InfiniBand, Storage devices, ....
|
||||||
|
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
Kubernetes currently supports discovery of CPU and Memory primarily to a
|
||||||
|
minimal extent. Very few devices are handled natively by Kubelet.
|
||||||
|
|
||||||
|
It is not a sustainable solution to expect every vendor to add their vendor
|
||||||
|
specific code inside Kubernetes. This approach does not scale and is not
|
||||||
|
portable.
|
||||||
|
|
||||||
|
We want a solution for those vendors to be able to advertise their resources
|
||||||
|
to kubelet and monitor them.
|
||||||
|
We also want a way for the user to specify which resource their jobs will use
|
||||||
|
and what constraints are associated to these resources.
|
||||||
|
|
||||||
|
In order to solve this problem it is obvious that we need a plugin system in
|
||||||
|
order to have vendors advertise and monitor their resources on behalf
|
||||||
|
of Kubelet.
|
||||||
|
|
||||||
|
Additionally, we introduce the concept of Device to be able to select
|
||||||
|
resources with constraints in a pod spec.
|
||||||
|
|
||||||
|
_GPU Integration Example:_
|
||||||
|
* [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
|
||||||
|
* [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
|
||||||
|
|
||||||
|
_Kubernetes Meeting Notes On This:_
|
||||||
|
* [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
|
||||||
|
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
|
||||||
|
* [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
|
||||||
|
|
||||||
|
## Use Cases
|
||||||
|
|
||||||
|
* I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
|
||||||
|
in my pod.
|
||||||
|
* I should be able to use that device without writing custom Kubernetes code.
|
||||||
|
* I want a consistent and portable solution to consume hardware devices
|
||||||
|
across k8s clusters.
|
||||||
|
|
||||||
|
## Objectives
|
||||||
|
|
||||||
|
1. Add support for vendor specific Devices in kubelet:
|
||||||
|
* Through a pluggable mechanism.
|
||||||
|
* Which allows discovery and monitoring of devices.
|
||||||
|
* Which allows hooking the runtime to make devices available in containers
|
||||||
|
and cleaning them up.
|
||||||
|
2. Define a deployment mechanism for this new API.
|
||||||
|
3. Define a versioning mechanism for this new API.
|
||||||
|
|
||||||
|
## Non Objectives
|
||||||
|
1. Advanced scheduling and resource selection (solved through [#782](https://github.com/Kubernetes/community/pull/782)).
|
||||||
|
We will only try to give basic selection primitives to the devices
|
||||||
|
2. Metrics: this should be the job of cadvisor and should probably either be
|
||||||
|
addressed there (cadvisor) or if people feel there is a case to be made
|
||||||
|
for it being addressed in the Device Plugin, in a follow up proposal.
|
||||||
|
|
||||||
|
## Stories
|
||||||
|
|
||||||
|
### Vendor story
|
||||||
|
|
||||||
|
Kubernetes provides to vendors a mechanism called device plugins to:
|
||||||
|
* advertise devices.
|
||||||
|
* monitor devices (currently perform health checks).
|
||||||
|
* hook into the runtime to instruct Kubelet what are the steps to
|
||||||
|
take in order to make the device available (or cleanup the device).
|
||||||
|
|
||||||
|
A device plugin at it's core is a simple gRPC server usually running in
|
||||||
|
a container and deployed across clusters through a daemonSet.
|
||||||
|
|
||||||
|
```gRPC
|
||||||
|
service DevicePlugin {
|
||||||
|
rpc Discover(Empty) returns (stream Device) {}
|
||||||
|
rpc Monitor(Empty) returns (stream DeviceHealth) {}
|
||||||
|
|
||||||
|
rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
|
||||||
|
rpc Deallocate(DeallocateRequest) returns (Empty) {}
|
||||||
|
}
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
The gRPC server that the device plugin must implement is expected to
|
||||||
|
be advertised on a unix socket in a mounted hostPath (e.g:
|
||||||
|
`/var/run/Kubernetes/vendor.sock`).
|
||||||
|
|
||||||
|
Finally, to notify Kubelet of the existence of the device plugin,
|
||||||
|
the vendor's device plugin will have to make a request to Kubelet's
|
||||||
|
onwn gRPC server.
|
||||||
|
Only then will kubelet start interacting with the vendor's device plugin
|
||||||
|
through the gRPC apis.
|
||||||
|
|
||||||
|
### End User story
|
||||||
|
|
||||||
|
When setting up the cluster the admin knows what kind of devices are present
|
||||||
|
on the different machines and therefore can select what devices they want to
|
||||||
|
enable.
|
||||||
|
|
||||||
|
The cluster admins knows his cluster has NVIDIA GPUs therefore he deploys
|
||||||
|
the NVIDIA device plugin through:
|
||||||
|
`kubectl create -f NVIDIA.io/device-plugin.yml`
|
||||||
|
|
||||||
|
The device plugin lands on all the nodes of the cluster and if it detects that
|
||||||
|
there are no GPUs it terminates. However, when there are GPUs it reports them
|
||||||
|
to Kubelet.
|
||||||
|
For device plugins reporting non-GPU Devices these are advertised as
|
||||||
|
OIRs and selected through the same method.
|
||||||
|
|
||||||
|
1. A user submits a pod spec requesting X GPUs (or devices)
|
||||||
|
2. The scheduler filters the nodes which do not match the resource requests
|
||||||
|
3. The pod lands on the node and Kubelet decides which device
|
||||||
|
should be assigned to the pod
|
||||||
|
4. Kubelet calls `Allocate` on the matching Device Plugins
|
||||||
|
5. The user deletes the pod or the pod terminates
|
||||||
|
6. Kubelet calls `Deallocate` on the matching Device Plugins
|
||||||
|
|
||||||
|
When receiving a pod which requests Devices kubelet is in charge of:
|
||||||
|
* deciding which device to assign to the pod's containers (this will
|
||||||
|
change in the future)
|
||||||
|
* advertising the changes to the node's `Available` list
|
||||||
|
* advertising the changes to the pods's `Allocated` list
|
||||||
|
* Calling the `Allocate` function with the list of devices
|
||||||
|
|
||||||
|
The scheduler is still be in charge of filtering the nodes which cannot
|
||||||
|
satisfy the resource requests.
|
||||||
|
He might in the future be in charge of selecting the device.
|
||||||
|
|
||||||
|
## Device Plugin
|
||||||
|
|
||||||
|
### Introduction
|
||||||
|
The device plugin is structured in 5 parts:
|
||||||
|
1. Registration: The device plugin advertises it's presence to Kubelet
|
||||||
|
2. Discovery: Kubelet calls the device plugin to list it's devices
|
||||||
|
3. Allocate / Deallocate: When creating/deleting containers requesting the
|
||||||
|
devices advertised by the device plugin, Kubelet calls the device plugin's
|
||||||
|
`Allocate` and `Deallocate` functions.
|
||||||
|
4. Cleanup: Kubelet terminates the communication through a "Stop"
|
||||||
|
4. Heartbeat: The device plugin polls Kubelet to know if it's still alive
|
||||||
|
and if it has to re-issue a Register request
|
||||||
|
|
||||||
|
### Registration
|
||||||
|
|
||||||
|
When starting the device plugin is expected to make a (client) gRPC call
|
||||||
|
to the `Register` function that Kubelet exposes.
|
||||||
|
|
||||||
|
The communication between Kubelet is expected to happen only through Unix
|
||||||
|
sockets and follow this simple pattern:
|
||||||
|
1. The device plugins starts it's gRPC server
|
||||||
|
2. The device plugins sends a `RegisterRequest` to Kubelet (through a
|
||||||
|
gRPC request)
|
||||||
|
4. Kubelet starts it's Discovery phase and calls `Discover` and `Monitor`
|
||||||
|
5. Kubelet answers to the `RegisterRequest` with a `RegisterResponse`
|
||||||
|
containing any error Kubelet might have encountered
|
||||||
|
|
||||||
|
### Unix Socket
|
||||||
|
|
||||||
|
Device Plugins are expected to communicate with Kubelet through gRPC
|
||||||
|
on an Unix socket.
|
||||||
|
When starting the gRPC server, they are expected to create a unix socket
|
||||||
|
at the following host path: `/var/run/Kubernetes`.
|
||||||
|
|
||||||
|
For non bare metal device plugin this means they will have to mount the folder
|
||||||
|
as a volume in their pod spec ([see Installation](##installation)).
|
||||||
|
|
||||||
|
Device plugins can expect to find the socket to register themselves on
|
||||||
|
the host at the following path:
|
||||||
|
`/var/run/Kubernetes/kubelet.sock`.
|
||||||
|
|
||||||
|
### Protocol Overview
|
||||||
|
|
||||||
|
When first registering themselves against Kubelet, the device plugin
|
||||||
|
will send:
|
||||||
|
* The name of their unix socket
|
||||||
|
* [The API version against which they were built](#versioning).
|
||||||
|
* Their `Vendor` ID or name of the device plugin
|
||||||
|
|
||||||
|
Kubelet answers with the minimum version it supports and whether or
|
||||||
|
not there was an error. The errors may include (but not limited to):
|
||||||
|
* API version not supported
|
||||||
|
* A device plugin was already registered for this vendor
|
||||||
|
* A device plugin already registered this device
|
||||||
|
* Vendor is not consistent across discovered devices
|
||||||
|
|
||||||
|
Kubelet will then interact with the plugin through the following functions:
|
||||||
|
* `Discover`: List Devices
|
||||||
|
* `Monitor`: Returns a stream that is written to when a
|
||||||
|
Device becomes unhealty
|
||||||
|
* `Allocate`: Called when creating a container with a list of devices
|
||||||
|
can request changes to the Container config
|
||||||
|
* `Deallocate`: Called when deleting a container can be used for cleanup
|
||||||
|
|
||||||
|
The device plugin is also expected to periodically call the `Heartbeat` function
|
||||||
|
exposed by Kubelet and issue a `Registration` request when it either can't reach
|
||||||
|
Kubelet or Kubelet answers with a `KO` response.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
### Protobuf specification
|
||||||
|
|
||||||
|
```go
|
||||||
|
service PluginRegistration {
|
||||||
|
rpc Register(RegisterRequest) returns (RegisterResponse) {}
|
||||||
|
rpc Heartbeat(HeartbeatRequest) returns (HeartbeatResponse) {}
|
||||||
|
}
|
||||||
|
|
||||||
|
service DevicePlugin {
|
||||||
|
rpc Discover(Empty) returns (stream Device) {}
|
||||||
|
rpc Monitor(Empty) returns (stream DeviceHealth) {}
|
||||||
|
|
||||||
|
rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
|
||||||
|
rpc Deallocate(DeallocateRequest) returns (Empty) {}
|
||||||
|
}
|
||||||
|
|
||||||
|
message RegisterRequest {
|
||||||
|
// Version of the API the Device Plugin was built against
|
||||||
|
string version = 1;
|
||||||
|
// Name of the unix socket the device plugin is listening on
|
||||||
|
string unixsocket = 2;
|
||||||
|
// Name of the devices the device plugin wants to register
|
||||||
|
// A device plugin can only register one kind of devices
|
||||||
|
string vendor = 3;
|
||||||
|
}
|
||||||
|
|
||||||
|
message RegisterResponse {
|
||||||
|
// Minimum version the Kubelet API supports.
|
||||||
|
string version = 1;
|
||||||
|
// Kubelet fills this field if it encounters any errors
|
||||||
|
// during the registration process or discover process
|
||||||
|
Error error = 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
message HeartbeatRequest {
|
||||||
|
string vendor = 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
message HeartbeatResponse {
|
||||||
|
// Kubelet answers with a string telling the device
|
||||||
|
// plugin to either re-register itself or not
|
||||||
|
string response = 1;
|
||||||
|
// Kubelet fills this field if it encountered any errors
|
||||||
|
Error error = 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
message AllocateRequest {
|
||||||
|
repeated Device devices = 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
message AllocateResponse {
|
||||||
|
// List of environment variable to set in the container.
|
||||||
|
repeated KeyValue envs = 1;
|
||||||
|
// Mounts for the container.
|
||||||
|
repeated Mount mounts = 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
message DeallocateRequest {
|
||||||
|
repeated Device devices = 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
message Error {
|
||||||
|
bool error = 1;
|
||||||
|
string reason = 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
// E.g:
|
||||||
|
// struct Device {
|
||||||
|
// Kind: "NVIDIA-gpu"
|
||||||
|
// Name: "GPU-fef8089b-4820-abfc-e83e-94318197576e"
|
||||||
|
// Properties: {
|
||||||
|
// "Family": "Pascal",
|
||||||
|
// "Memory": "4G",
|
||||||
|
// "ECC" : "True",
|
||||||
|
// }
|
||||||
|
//}
|
||||||
|
//
|
||||||
|
message Device {
|
||||||
|
string Kind = 1;
|
||||||
|
string Name = 2;
|
||||||
|
string Health = 3;
|
||||||
|
string Vendor = 4;
|
||||||
|
map<string, string> properties = 5; // Could be [1, 1.2, 1G]
|
||||||
|
}
|
||||||
|
|
||||||
|
message DeviceHealth {
|
||||||
|
string Name = 1;
|
||||||
|
string Kind = 2;
|
||||||
|
string Vendor = 4;
|
||||||
|
string Health = 3;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
The installation process should be straightforward to the user, transparent
|
||||||
|
and similar to other regular Kubernetes actions.
|
||||||
|
The device plugin should also run in containers so that Kubernetes can
|
||||||
|
deploy them and restart the plugins when they fail.
|
||||||
|
However, we should not prevent the user from deploying a bare metal device
|
||||||
|
plugin.
|
||||||
|
|
||||||
|
Deploying the device plugins through DemonSets makes sense as the cluster
|
||||||
|
admin would be able to specify which machines it wants the device plugins to
|
||||||
|
run on, the process is similar to any Kubernetes action and does not require
|
||||||
|
to change any parts of Kubernetes.
|
||||||
|
|
||||||
|
Additionally, for integrated solutions such as `kubeadm` we can add support
|
||||||
|
to auto-deploy community vetted Device Plugins.
|
||||||
|
Thus not fragmenting once more the Kubernetes ecosystem.
|
||||||
|
|
||||||
|
For users installing Kubernetes without using an integrated solution such
|
||||||
|
as `kubeadm` they would use the examples that we would provide at:
|
||||||
|
`https://github.com/Kubernetes/Kubernetes/tree/master/examples/device-plugin.yaml`
|
||||||
|
|
||||||
|
YAML example:
|
||||||
|
```yaml
|
||||||
|
apiVersion: extensions/v1beta1
|
||||||
|
kind: DaemonSet
|
||||||
|
metadata:
|
||||||
|
spec:
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
- name: device-plugin
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
name: device-plugin-ctr
|
||||||
|
image: NVIDIA/device-plugin:1.0
|
||||||
|
volumeMounts:
|
||||||
|
- mountPath: /device-plugin
|
||||||
|
- name: device-plugin
|
||||||
|
volumes:
|
||||||
|
- name: device-plugin
|
||||||
|
hostPath:
|
||||||
|
path: /var/run/Kubernetes
|
||||||
|
```
|
||||||
|
|
||||||
|
## API Changes
|
||||||
|
### Device
|
||||||
|
|
||||||
|
When discovering the devices, Kubelet will be in charge of advertising those
|
||||||
|
resources to the API server.
|
||||||
|
|
||||||
|
We will advertise each device returned by the Device Plugin in a new structure
|
||||||
|
called `Device`.
|
||||||
|
It is defined as follows:
|
||||||
|
|
||||||
|
```golang
|
||||||
|
type Device struct {
|
||||||
|
Kind string
|
||||||
|
Vendor string
|
||||||
|
Name string
|
||||||
|
Health DeviceHealthStatus
|
||||||
|
Properties map[string]string
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Because the current API (Capacity) can not be extended to support Device,
|
||||||
|
we will need to create two new attributes in the NodeStatus structure:
|
||||||
|
* `DevCapacity`: Describing the device capacity of the node
|
||||||
|
* `DevAvailable`: Describing the available devices
|
||||||
|
|
||||||
|
```golang
|
||||||
|
type NodeStatus struct {
|
||||||
|
DevCapacity []Device
|
||||||
|
DevAvailable []Device
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
We also introduce the `Allocated` field in the pod's status so that user
|
||||||
|
can know what devices were assigned to the pod. It could also be useful in
|
||||||
|
the case of monitoring
|
||||||
|
|
||||||
|
```golang
|
||||||
|
type ContainerStatus struct {
|
||||||
|
Devices []Device
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
# Versioning
|
||||||
|
|
||||||
|
Currently there is only one part (CRI) of Kubernetes which is based on
|
||||||
|
a protobuf model.
|
||||||
|
|
||||||
|
The model used by CRI as of now involves the client (kubelet) checking
|
||||||
|
if the server (runtime) version is compatible and then continuing to
|
||||||
|
communicate with the server.
|
||||||
|
Currently for CRI, compatible means matching the exact version.
|
||||||
|
This means that every time the CRI spec changes the CRI clients needs to
|
||||||
|
be updated.
|
||||||
|
|
||||||
|
CRI also uses gRPC-go, which requires the same package name between client
|
||||||
|
and server.
|
||||||
|
If they are not same, then no API calls can succeed because the generated grpc
|
||||||
|
code registers a service using the `package_name.service_name` convention,
|
||||||
|
e.g., The StopPodSandbox method is known as `/v1alpha1.RuntimeService/StopPodSandbox.`
|
||||||
|
|
||||||
|
To work around this restriction, CRI adopted the strategy to freeze the
|
||||||
|
package name at `pkg/kubelet/apis/cri/v1alpha1/runtime`.
|
||||||
|
|
||||||
|
Considering the restrictions it seems reasonable to follow the same pattern for
|
||||||
|
the device plugin proposal to prevent API breaking:
|
||||||
|
* Follow protobuf guidelines on versionning:
|
||||||
|
* Do not change ordering
|
||||||
|
* Do not remove fields or change types
|
||||||
|
* Add optional fields
|
||||||
|
* Freeze the package name to `apis/device-plugin/v1alpha1`
|
||||||
|
* Have kubelet and the Device Plugin negotiate versions if we do break the API
|
||||||
|
|
||||||
|
Negotiation would take place in the registration:
|
||||||
|
1. When registering itself with Kubelet, the Device plugin sends the version
|
||||||
|
against which it was built.
|
||||||
|
2. Kubelet returns the minimum version it supports and if the version sent
|
||||||
|
is supported.
|
||||||
|
3. If Kubelet supports the version sent by the Device Plugin, it
|
||||||
|
contacts the Device Plugin
|
||||||
|
4. If the Device Plugin supports the version sent by Kubelet it can and should
|
||||||
|
answer the different calls made by Kubelet
|
||||||
Binary file not shown.
|
After Width: | Height: | Size: 60 KiB |
Loading…
Reference in New Issue