diff --git a/contributors/design-proposals/device-plugin-2.png b/contributors/design-proposals/device-plugin-2.png new file mode 100644 index 000000000..60892429f Binary files /dev/null and b/contributors/design-proposals/device-plugin-2.png differ diff --git a/contributors/design-proposals/device-plugin.md b/contributors/design-proposals/device-plugin.md index 339070d89..2f73c9324 100644 --- a/contributors/design-proposals/device-plugin.md +++ b/contributors/design-proposals/device-plugin.md @@ -1,99 +1,77 @@ -# Device Manager Proposal +Device Manager Proposal +=============== - 1. [Abstract](#abstract) - 2. [Motivation](#motivation) - 3. [Use Cases](#use-cases) - 4. [Objectives](#objectives) - 5. [Non Objectives](#non-objectives) - 6. [Stories](#stories) - * [Vendor story](#vendor-story) - * [User story](#user-story) - 8. [Device Plugin](#device-plugin) - * [Protocol Overview](#protocol-overview) - * [Protobuf specification](#protobuf-specification) - * [Installation](#installation) - * [API Changes](#api-changes) - * [Versioning](#versioning) + + +- [Motivation](#motivation) +- [Use Cases](#use-cases) +- [Objectives](#objectives) +- [Non Objectives](#non-objectives) +- [Proposed Implementation 1](#proposed-implementation-1) + - [Vendor story](#vendor-story) + - [End User story](#end-user-story) + - [Device Plugin](#device-plugin) + - [Introduction](#introduction) + - [Registration](#registration) + - [Unix Socket](#unix-socket) + - [Protocol Overview](#protocol-overview) + - [Protobuf specification](#protobuf-specification) +- [Proposed Implementation 2](#proposed-implementation-2) + - [Device Plugin Lifecycle](#device-plugin-lifecycle) + - [Protobuf API](#protobuf-api) + - [Failure recovery](#failure-recovery) + - [Roadmap](#roadmap) + - [Open Questions](#open-questions-1) +- [Installation](#installation) +- [Versioning](#versioning) + - [References](#references) + + _Authors:_ * @RenaudWasTaken - Renaud Gaubert <rgaubert@NVIDIA.com> -## Abstract - -This document describes a vendor independant solution to: - * Discovering and representing external devices - * Making these devices available to the container and cleaning them up - afterwards - * Health Check of these devices - -Because devices are vendor dependant and have their own sets of problems -and mechanisms, the solution we describe is a plugin mechanism managed by -Kubelet. - -At their core, device plugins are simple gRPC servers that may run in a -container deployed through the pod mechanism. - -These servers implement the gRPC interface defined later in this design -document and once the device plugin makes itself know to kubelet, kubelet -will interact with the device through three simple functions: - 1. A `Discover` function for the kubelet to Discover the devices and - their properties. - 2. An `Allocate` and `Deallocate` function which are called respectively - before container creation and after container deletion with the - devices to allocate and deallocate. - 3. A `Monitor` function to notify Kubelet whenever a device becomes - unhealthy. - -The goal is for a user to be able to enable vendor devices (e.g: GPUs) through -the simple following steps: - * `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml` - * When launching `kubectl describe nodes`, the devices appear in the node spec - * In the long term users will be able to select them through Resource Class - -We expect the plugins to be deployed across the clusters through DaemonSets. -The targeted devices are GPUs, NICs, FPGAs, InfiniBand, Storage devices, .... - - -## Motivation +# Motivation Kubernetes currently supports discovery of CPU and Memory primarily to a minimal extent. Very few devices are handled natively by Kubelet. It is not a sustainable solution to expect every vendor to add their vendor -specific code inside Kubernetes. This approach does not scale and is not -portable. +specific code inside Kubernetes to make their devices usable. +Instead, we want a solution for vendors to be able to advertise their resources +to Kubelet and monitor them without writing custom Kubernetes code. +We also want to provide a consistent and portable solution for users to +consume hardware devices across k8s clusters. -We want a solution for those vendors to be able to advertise their resources -to kubelet and monitor them. -We also want a way for the user to specify which resource their jobs will use -and what constraints are associated to these resources. +This document describes a vendor independant solution to: + * Discovering and representing external devices + * Making these devices available to the containers using these devices and + cleaning them up afterwards + * Monitoring these devices -In order to solve this problem it is obvious that we need a plugin system in -order to have vendors advertise and monitor their resources on behalf -of Kubelet. +Because devices are vendor dependant and have their own sets of problems +and mechanisms, the solution we describe is a plugin mechanism that may run +in a container deployed through the DaemonSets mechanism. +The targeted devices include GPUs, High-performance NICs, FPGAs, InfiniBand, +Storage devices, and other similar computing resources that require vendor +specific initialization and setup. -Additionally, we introduce the concept of Device to be able to select -resources with constraints in a pod spec. +The goal is for a user to be able to enable vendor devices (e.g: GPUs) through +the following simple steps: + * `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml` + * When launching `kubectl describe nodes`, the devices appear in the node spec + * In the long term users will be able to select them through Resource Class -_GPU Integration Example:_ - * [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136) - * [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116) +# Use Cases -_Kubernetes Meeting Notes On This:_ - * [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#) - * [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc) - * [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit) + * I want to use a particular device type (GPU, InfiniBand, FPGA, etc.) + in my pod. + * I should be able to use that device without writing custom Kubernetes code. + * I want a consistent and portable solution to consume hardware devices + across k8s clusters. -## Use Cases - - * I want to use a particular device type (GPU, InfiniBand, FPGA, etc.) - in my pod. - * I should be able to use that device without writing custom Kubernetes code. - * I want a consistent and portable solution to consume hardware devices - across k8s clusters. - -## Objectives +# Objectives 1. Add support for vendor specific Devices in kubelet: * Through a pluggable mechanism. @@ -103,16 +81,18 @@ _Kubernetes Meeting Notes On This:_ 2. Define a deployment mechanism for this new API. 3. Define a versioning mechanism for this new API. -## Non Objectives -1. Advanced scheduling and resource selection (solved through [#782](https://github.com/Kubernetes/community/pull/782)). +# Non Objectives + +1. Advanced scheduling and resource selection (solved through + [#782](https://github.com/Kubernetes/community/pull/782)). We will only try to give basic selection primitives to the devices 2. Metrics: this should be the job of cadvisor and should probably either be addressed there (cadvisor) or if people feel there is a case to be made for it being addressed in the Device Plugin, in a follow up proposal. -## Stories +# Proposed Implementation 1 -### Vendor story +## Vendor story Kubernetes provides to vendors a mechanism called device plugins to: * advertise devices. @@ -144,7 +124,7 @@ onwn gRPC server. Only then will kubelet start interacting with the vendor's device plugin through the gRPC apis. -### End User story +## End User story When setting up the cluster the admin knows what kind of devices are present on the different machines and therefore can select what devices they want to @@ -182,6 +162,7 @@ He might in the future be in charge of selecting the device. ## Device Plugin ### Introduction + The device plugin is structured in 5 parts: 1. Registration: The device plugin advertises it's presence to Kubelet 2. Discovery: Kubelet calls the device plugin to list it's devices @@ -189,7 +170,7 @@ The device plugin is structured in 5 parts: devices advertised by the device plugin, Kubelet calls the device plugin's `Allocate` and `Deallocate` functions. 4. Cleanup: Kubelet terminates the communication through a "Stop" -4. Heartbeat: The device plugin polls Kubelet to know if it's still alive +5. Heartbeat: The device plugin polls Kubelet to know if it's still alive and if it has to re-issue a Register request ### Registration @@ -247,7 +228,7 @@ The device plugin is also expected to periodically call the `Heartbeat` function exposed by Kubelet and issue a `Registration` request when it either can't reach Kubelet or Kubelet answers with a `KO` response. -![Process](./device-plugin.png) +![Process](device-plugin.png) ### Protobuf specification @@ -343,7 +324,233 @@ message DeviceHealth { } ``` -## Installation +# Proposed Implementation 2 + +The main strategy of this proposed implemenation is that we want to start with +something simple that can show benefits on our immediate use cases, yet the +API design should be extendable to support future requirements. +Here are the main motivations for this alternative proposed implementation: + +* Discovery phase: can we eliminate this gRPC procedure? It seems more + natural for device plugin to send Kubelet the discovered device information + right after device initialization and the registration gRPC procedure. +* The current implementation uses gRPC to communicate between Kubelet and + device plugin. Both Kubelet and device plugin need to start a gRPC server + for two-way communication. This seems a bit complicated. Can we have + device plugin send enough information to Kubelet so that we only need + Kubelet to start gRPC server and device plugin is kept as gRPC client? + The main concern with one-way gRPC communication is that we can NOT + support device specific operations, like reset device, during + allocation/deallocation. Depending on how long we expect device specific + operations to take, we can support this feature later by + having device plugin also provide a gRPC service or Device Plugin + can instruct Kubelet to perform device specific operation hooks. +* Do we need checkpointing in the initial prototype implementation? + Even in alpha release, we may still want to be able to recover from + various failure scenario. Otherwise, it would affect user experience. + Currently, it seems the only information we need to record somewhere + is what device is allocated to what pod/container. There have been + discussions on different ways and places to record this information. + The approach taken by the current implementation pushes this information + to ApiServer by extending NodeStatus interface between Kubelet and ApiServer. + The major concern on this approach is that it introduced an API extension + apart from the current model (Currently Node information recorded at ApiServer + only contains resource capacity information. Resource allocation information + is kept at Node). The second approach is for Kubelet to checkpoint this + information. This seems to align with the current Kubernetes model that + Kubelet is the component to implement allocation/deallocation functionalities. + The information we want to checkpoint, i.e., what device is allocated to what + pod/container, also seems generic enough to be implemented at Kubelet. + It may also allow other use cases outside device plugin, e.g., cpu pin. + The third approach is to implement this in device plugin. This way, + device plugin can also record any state information useful to its own + failure recovery in checkpoints. One concern on this approach is that it + may add more burdens on vendors to implement their device plugin images. + Surprises might happen in production if things were not implemented correctly. + It also seems apart from the current model as today Kubelet is the place + where allocation/deallocation happens for other types of resources. +* Heartbeat: do we need this to make sure connections can be re-established + between kubelet and device plugin? Can we reuse keepalive feature from gRPC? + Or if Kubelet checkpoints device allocation state information, device plugin + may only need to detect Kubelet failure when it needs to update device + information. Or can device plugin send periodic device state updates + (this may be needed anyway if we want to collect device usage stats) + and use that to detect Kubelet failure or device plugin failure? + +## Device Plugin Lifecycle + +![Process](device-plugin-2.png) + +1. User or cluster admin push vendor-specific device plugin DaemonSets. + The DaemonSets YAML config includes mountPaths to the host directories + where device driver, user-space libraries, and tools will be installed. +2. After device plugin container is brought up, it detects the specific + types of HW devices. If such devices exist, it initializes these devices + and sets up the environments to access these devices (e.g., install + device drivers, user-space libraries, and tools). +3. After initialization, device plugin queries HW device states through the + installed device monitoring tools or other device interfaces. Then device + plugin connects to the Kubelet device plugin gRPC server and sends it the + obtained list of HW device information. In the initial prototype, the + device resource exported by a device plugin can be implemented as an + [extended OIR](https://github.com/kubernetes/kubernetes/pull/48922) + with special prefix “extensions.kubernetes.io/”, plus device resource_name + that uniquely identifies a device plugin on a node. + Kubelet can use existing API to add this resource to API server so that the + device resource is available for scheduling. +4. Device plugin runs in a loop to continuously query HW device states. If it + detects any changes, it sends the Kubelet device plugin gRPC server the new + list of HW device information. Kubelet can use this information to update its + device capacity states and if necessary, re-allocate new device to a user + container with unhealthy allocated devices. +5. When Kubelet receives an allocation request for a HW device advertised + by a device plugin (i.e., resource with “extensions.kubernetes.io/” prefix + plus device resource_name), it updates its internal allocation state, + issues certain calls to CRI to bind mount the host directories where + user-space libraries and tools are installed to the device-specific + default directories in user Pod or set up certain environment variables, + and checkpoints the container-to-device allocation information to persistent + storage. +6. When user container accessing the device finishes, Kubelet updates its + internal state to deallocate the device, and updates its checkpoint state + in persistent storage. +7. When device plugin DaemonSets is removed, clean up device state (e.g., uninstall + device driver, remove user-space libraries and tools). This step can be + specified as a preStop + [container lifecycle step](https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/). + Note one implication from this approach is that device plugin upgrade process + will be disruptive. It will need more thinkings if we want to + support non-disruptive upgrade process. + +## Protobuf API + +```go + +service PluginResource { + rpc Register(RegisterRequest) returns (RegisterResponse) {} + rpc ReportDeviceInfo(ReportRequest) returns (ReportResponse) {} +} + +message RegisterRequest { + // Version of the API the Device Plugin was built against + string version = 1; + // E.g., "nvidia-gpu". Used to construct OIR: + // “extensions.kubernetes.io/resourcename”. + string resourcename = 2; +} + +message RegisterResponse { + bool success = 1; + // Kubelet fills this field with details if it encounters any errors + // during the registration process, e.g., for version mismatch, what + // is the required version and minimum supported version by kubelet. + string error = 2; +} + +message ReportRequest { + repeated DeviceInfo devices = 1; +} + +message DeviceInfo { + // E.g., "GPU-fef8089b-4820-abfc-e83e-94318197576e". + // Needs to be unique per device plugin. + string Id = 1; + // E.g., UNKNOWN, HEALTHY, UNHEALTHY. + enum State = 2; + // E.g., {"/rootfs/nvidia":"/usr/local/nvidia"} + // Maps from host directory where device library or tools + // are installed to user pod directory where the library or + // tools are expected to be accessed. Kubelet will use this + // information to bind mount host directory to the user pod + // directory during allocation. + map mountpaths = 3; + // E.g., {"LD_PRELOAD":"xxx.so"}. Kubelet will export these + // env variables in user pod during allocation. + map envariables = 4; + // E.g., {"Family":"Pascal"} {"ECC":"True"} + // These fields can be used as node labels for selection + map labels = 4; +} + +message ReportResponse { + bool success = 1; + // Kubelet fills this field if it encounters any errors + // during the report process, e.g., device plugin hasn’t + // registered yet (could happen when Kubelet restarts). + string error = 2; +} +``` + +## Failure recovery + +* Device failure: Device plugin should be able to detect device failure and + report that to Kubelet. Kubelet should then remove the failed device from + available list. If there is any user container using the failed device, + Kubelet may terminate the user container and reschedule it on a good + available device. When a failed device recovers, device plugin will send + Kubelet the updated device state and Kubelet can add the device to the + available device list. +* Kubelet crash: When Kubelet restarts after a crash, it should be able to + recover allocation states from the checkpoints recorded on persistent storage. + The checkpoint records should include allocated device id to pod mapping + information as well as non-allocated device information, so Kubelet can + re-establish precise allocation state. Device plugin should be able to + detect Kubelet failure when it needs to update device informaiton, + and re-registers with the new Kubelet. +* Device plugin crash: A device plugin is deployed through DaemonSets. + If a device plugin process fails, Kubelet will detect that and automatically + restart it. After restart, device plugin will reconnect to Kubelet and + report the current device states. Kubelet can compare the reported device + information with its internal device states, and makes adjustments if + necessary. One thing we need to pay special attention is that device plugin + may fail at any time, e.g., during initialization. When the new device plugin + process starts, it needs to be able to recover from incomplete states. + +## Roadmap + +* Phase 1: device plugin is supported in alpha mode in 1.8 kubernetes release. + Make sure it provides the following functionalities: initialize, discover, + allocate/deallocate, cleanup, basic health check, and can recover from device, + Kubelet or device plugin failures . Make sure the interface is kept simple and + extensible, and the document is clear. With the initial implemented API, + make sure we can use the interface to implement device plugin images for at + least two types of devices: Nvidia GPU and Solarflare NIC. Note the support + for Nvidia GPU will help gpu support to enter beta by providing a general + and extendable api. Test coverage: e2e tests with the developed Nvidia GPU + image and Solarflare image to make sure these devices can be correctly + initialized, allocated, deallocated, and cleaned up. Also should test we + can recover from device failure, Kubelet restarts, and device plugin failure. +* Phase 2: device plugin is supported in beta mode in 1.9 kubernetes release. + At this phase, the primary design and API should be stabilized. We need to + implement authentication mechanism to ensure only trusted device plugin + images can be registered. We can support device specific + allocation/deallocation requests by having device plugin also provide a gRPC + service or Device Plugin can instruct Kubelet to perform device specific + operation hooks during allocation/deallocation procedures. + Hopefully at this time, we may make good progress on supporting more flexible + resource allocation policies in Kubernetes, and with that, we can switch + device plugin from using OIR to using ResourceClass to allow more efficient + HW specific resource allocations, e.g., topology aware resource allocations, + NUMA aware resource allocations etc. +* Phase 3: device plugin is supported in GA mode in 1.10 kubernetes release. + Device plugin should have clear error handling and problem report that + allows easy debuggability and monitoring on its exported devices. + We should have clear documentation on how to develop a device plugin and + how interact with device plugin. The framework needs to be stable and + demonstrate good user experiences through the support on multiple types + of devices, such as GPU, Infiniband, high-performance NIC, and etc. + +## Open Questions + +* The proposal assumes we can omit device specific allocation/deallocation +operations in the alpha release and support this feature in later releases. +If people are concerned that such omission would impact the usability of +alpha release, we will need to come up with a solution that would either +require two-way gRPC communication between Kubelet and Device Plugin or +Device Plugin can instruct Kubelet to perform device specific operation hooks +during allocation/deallocation procedures. + +# Installation The installation process should be straightforward to the user, transparent and similar to other regular Kubernetes actions. @@ -366,6 +573,7 @@ as `kubeadm` they would use the examples that we would provide at: `https://github.com/Kubernetes/Kubernetes/tree/master/examples/device-plugin.yaml` YAML example: + ```yaml apiVersion: extensions/v1beta1 kind: DaemonSet @@ -388,48 +596,6 @@ spec: path: /var/run/Kubernetes ``` -## API Changes -### Device - -When discovering the devices, Kubelet will be in charge of advertising those -resources to the API server. - -We will advertise each device returned by the Device Plugin in a new structure -called `Device`. -It is defined as follows: - -```golang -type Device struct { - Kind string - Vendor string - Name string - Health DeviceHealthStatus - Properties map[string]string -} -``` - -Because the current API (Capacity) can not be extended to support Device, -we will need to create two new attributes in the NodeStatus structure: - * `DevCapacity`: Describing the device capacity of the node - * `DevAvailable`: Describing the available devices - -```golang -type NodeStatus struct { - DevCapacity []Device - DevAvailable []Device -} -``` - -We also introduce the `Allocated` field in the pod's status so that user -can know what devices were assigned to the pod. It could also be useful in -the case of monitoring - -```golang -type ContainerStatus struct { - Devices []Device -} -``` - # Versioning Currently there is only one part (CRI) of Kubernetes which is based on @@ -469,3 +635,12 @@ Negotiation would take place in the registration: contacts the Device Plugin 4. If the Device Plugin supports the version sent by Kubelet it can and should answer the different calls made by Kubelet + +## References + + * [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136) + * [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116) + * [Kubernetes Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#) + * [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc) + * [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit) +