Device plugin proposal patch by Jiaying

2017-07-19 17:53:03 -07:00 · 2017-07-19 17:53:03 -07:00 · 4cbc77e491
parent 9d7a245a23
commit 4cbc77e491
2 changed files with 304 additions and 129 deletions
--- a/contributors/design-proposals/device-plugin-2.png
+++ b/contributors/design-proposals/device-plugin-2.png
--- a/contributors/design-proposals/device-plugin.md
+++ b/contributors/design-proposals/device-plugin.md
@ -1,99 +1,77 @@
-# Device Manager Proposal
+Device Manager Proposal
 ===============
-  1. [Abstract](#abstract)
+<!-- BEGIN MUNGE: GENERATED_TOC -->
-  2. [Motivation](#motivation)
+
-  3. [Use Cases](#use-cases)
+- [Motivation](#motivation)
-  4. [Objectives](#objectives)
+- [Use Cases](#use-cases)
-  5. [Non Objectives](#non-objectives)
+- [Objectives](#objectives)
-  6. [Stories](#stories)
+- [Non Objectives](#non-objectives)
-      * [Vendor story](#vendor-story)
+- [Proposed Implementation 1](#proposed-implementation-1)
-      * [User story](#user-story)
+  - [Vendor story](#vendor-story)
-  8. [Device Plugin](#device-plugin)
+  - [End User story](#end-user-story)
-      * [Protocol Overview](#protocol-overview)
+  - [Device Plugin](#device-plugin)
-      * [Protobuf specification](#protobuf-specification)
+    - [Introduction](#introduction)
-      * [Installation](#installation)
+    - [Registration](#registration)
-      * [API Changes](#api-changes)
+    - [Unix Socket](#unix-socket)
-      * [Versioning](#versioning)
+    - [Protocol Overview](#protocol-overview)
    - [Protobuf specification](#protobuf-specification)
 - [Proposed Implementation 2](#proposed-implementation-2)
  - [Device Plugin Lifecycle](#device-plugin-lifecycle)
  - [Protobuf API](#protobuf-api)
  - [Failure recovery](#failure-recovery)
  - [Roadmap](#roadmap)
  - [Open Questions](#open-questions-1)
 - [Installation](#installation)
 - [Versioning](#versioning)
  - [References](#references)
 <!-- END MUNGE: GENERATED_TOC -->
 _Authors:_
 * @RenaudWasTaken - Renaud Gaubert &lt;rgaubert@NVIDIA.com&gt;
-## Abstract
+# Motivation
 This document describes a vendor independant solution to:
  * Discovering and representing external devices
  * Making these devices available to the container and cleaning them up
    afterwards
  * Health Check of these devices
 Because devices are vendor dependant and have their own sets of problems
 and mechanisms, the solution we describe is a plugin mechanism managed by
 Kubelet.
 At their core, device plugins are simple gRPC servers that may run in a
 container deployed through the pod mechanism.
 These servers implement the gRPC interface defined later in this design
 document and once the device plugin makes itself know to kubelet, kubelet
 will interact with the device through three simple functions:
  1. A `Discover` function for the kubelet to Discover the devices and
     their properties.
  2. An `Allocate` and `Deallocate` function which are called respectively
     before container creation and after container deletion with the
     devices to allocate and deallocate.
  3. A `Monitor` function to notify Kubelet whenever a device becomes
     unhealthy.
 The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
 the simple following steps:
  * `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
  * When launching `kubectl describe nodes`, the devices appear in the node spec
  * In the long term users will be able to select them through Resource Class
 We expect the plugins to be deployed across the clusters through DaemonSets.
 The targeted devices are GPUs, NICs, FPGAs, InfiniBand, Storage devices, ....
 ## Motivation
 Kubernetes currently supports discovery of CPU and Memory primarily to a
 minimal extent. Very few devices are handled natively by Kubelet.
 It is not a sustainable solution to expect every vendor to add their vendor
-specific code inside Kubernetes. This approach does not scale and is not
+specific code inside Kubernetes to make their devices usable.
-portable.
+Instead, we want a solution for vendors to be able to advertise their resources
 to Kubelet and monitor them without writing custom Kubernetes code.
 We also want to provide a consistent and portable solution for users to
 consume hardware devices across k8s clusters.
-We want a solution for those vendors to be able to advertise their resources
+This document describes a vendor independant solution to:
-to kubelet and monitor them.
+  * Discovering and representing external devices
-We also want a way for the user to specify which resource their jobs will use
+  * Making these devices available to the containers using these devices and
-and what constraints are associated to these resources.
+    cleaning them up afterwards
  * Monitoring these devices
-In order to solve this problem it is obvious that we need a plugin system in
+Because devices are vendor dependant and have their own sets of problems
-order to have vendors advertise and monitor their resources on behalf
+and mechanisms, the solution we describe is a plugin mechanism that may run
-of Kubelet.
+in a container deployed through the DaemonSets mechanism.
 The targeted devices include GPUs, High-performance NICs, FPGAs, InfiniBand,
 Storage devices, and other similar computing resources that require vendor
 specific initialization and setup.
-Additionally, we introduce the concept of Device to be able to select
+The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
-resources with constraints in a pod spec.
+the following simple steps:
  * `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
  * When launching `kubectl describe nodes`, the devices appear in the node spec
  * In the long term users will be able to select them through Resource Class
-_GPU Integration Example:_
+# Use Cases
  * [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
  * [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
-_Kubernetes Meeting Notes On This:_
+ * I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
-  * [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
+   in my pod.
-  * [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
+ * I should be able to use that device without writing custom Kubernetes code.
-  * [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
+ * I want a consistent and portable solution to consume hardware devices
   across k8s clusters.
-## Use Cases
+# Objectives
  * I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
    in my pod.
  * I should be able to use that device without writing custom Kubernetes code.
  * I want a consistent and portable solution to consume hardware devices
    across k8s clusters.
 ## Objectives
 1. Add support for vendor specific Devices in kubelet:
    * Through a pluggable mechanism.
@ -103,16 +81,18 @@ _Kubernetes Meeting Notes On This:_
 2. Define a deployment mechanism for this new API.
 3. Define a versioning mechanism for this new API.
-## Non Objectives
+# Non Objectives
-1. Advanced scheduling and resource selection (solved through [#782](https://github.com/Kubernetes/community/pull/782)).
+
 1. Advanced scheduling and resource selection (solved through
   [#782](https://github.com/Kubernetes/community/pull/782)).
   We will only try to give basic selection primitives to the devices
 2. Metrics: this should be the job of cadvisor and should probably either be
   addressed there (cadvisor) or if people feel there is a case to be made
   for it being addressed in the Device Plugin, in a follow up proposal.
-## Stories
+# Proposed Implementation 1
-### Vendor story
+## Vendor story
 Kubernetes provides to vendors a mechanism called device plugins to:
  * advertise devices.
@ -144,7 +124,7 @@ onwn gRPC server.
 Only then will kubelet start interacting with the vendor's device plugin
 through the gRPC apis.
-### End User story
+## End User story
 When setting up the cluster the admin knows what kind of devices are present
 on the different machines and therefore can select what devices they want to
@ -182,6 +162,7 @@ He might in the future be in charge of selecting the device.
 ## Device Plugin
 ### Introduction
 The device plugin is structured in 5 parts:
 1. Registration: The device plugin advertises it's presence to Kubelet
 2. Discovery: Kubelet calls the device plugin to list it's devices
@ -189,7 +170,7 @@ The device plugin is structured in 5 parts:
   devices advertised by the device plugin, Kubelet calls the device plugin's
   `Allocate` and `Deallocate` functions.
 4. Cleanup: Kubelet terminates the communication through a "Stop"
-4. Heartbeat: The device plugin polls Kubelet to know if it's still alive
+5. Heartbeat: The device plugin polls Kubelet to know if it's still alive
   and if it has to re-issue a Register request
 ### Registration
@ -247,7 +228,7 @@ The device plugin is also expected to periodically call the `Heartbeat` function
 exposed by Kubelet and issue a `Registration` request when it either can't reach
 Kubelet or Kubelet answers with a `KO` response.
-![Process](./device-plugin.png)
+![Process](device-plugin.png)
 ### Protobuf specification
@ -343,7 +324,233 @@ message DeviceHealth {
 }
 ```
-## Installation
+# Proposed Implementation 2
 The main strategy of this proposed implemenation is that we want to start with
 something simple that can show benefits on our immediate use cases, yet the
 API design should be extendable to support future requirements.
 Here are the main motivations for this alternative proposed implementation:
 * Discovery phase: can we eliminate this gRPC procedure? It seems more
  natural for device plugin to send Kubelet the discovered device information
  right after device initialization and the registration gRPC procedure.
 * The current implementation uses gRPC to communicate between Kubelet and
  device plugin. Both Kubelet and device plugin need to start a gRPC server
  for two-way communication. This seems a bit complicated. Can we have
  device plugin send enough information to Kubelet so that we only need
  Kubelet to start gRPC server and device plugin is kept as gRPC client?
  The main concern with one-way gRPC communication is that we can NOT
  support device specific operations, like reset device, during
  allocation/deallocation. Depending on how long we expect device specific
  operations to take, we can support this feature later by
  having device plugin also provide a gRPC service or Device Plugin
  can instruct Kubelet to perform device specific operation hooks.
 * Do we need checkpointing in the initial prototype implementation?
  Even in alpha release, we may still want to be able to recover from
  various failure scenario. Otherwise, it would affect user experience.
  Currently, it seems the only information we need to record somewhere
  is what device is allocated to what pod/container. There have been
  discussions on different ways and places to record this information.
  The approach taken by the current implementation pushes this information
  to ApiServer by extending NodeStatus interface between Kubelet and ApiServer.
  The major concern on this approach is that it introduced an API extension
  apart from the current model (Currently Node information recorded at ApiServer
  only contains resource capacity information. Resource allocation information
  is kept at Node). The second approach is for Kubelet to checkpoint this
  information. This seems to align with the current Kubernetes model that
  Kubelet is the component to implement allocation/deallocation functionalities.
  The information we want to checkpoint, i.e., what device is allocated to what
  pod/container, also seems generic enough to be implemented at Kubelet.
  It may also allow other use cases outside device plugin, e.g., cpu pin.
  The third approach is to implement this in device plugin. This way,
  device plugin can also record any state information useful to its own
  failure recovery in checkpoints. One concern on this approach is that it
  may add more burdens on vendors to implement their device plugin images.
  Surprises might happen in production if things were not implemented correctly.
  It also seems apart from the current model as today Kubelet is the place
  where allocation/deallocation happens for other types of resources.
 * Heartbeat: do we need this to make sure connections can be re-established
  between kubelet and device plugin? Can we reuse keepalive feature from gRPC?
  Or if Kubelet checkpoints device allocation state information, device plugin
  may only need to detect Kubelet failure when it needs to update device
  information. Or can device plugin send periodic device state updates
  (this may be needed anyway if we want to collect device usage stats)
  and use that to detect Kubelet failure or device plugin failure?
 ## Device Plugin Lifecycle
 ![Process](device-plugin-2.png)
 1. User or cluster admin push vendor-specific device plugin DaemonSets.
   The DaemonSets YAML config includes mountPaths to the host directories
   where device driver, user-space libraries, and tools will be installed.
 2. After device plugin container is brought up, it detects the specific
   types of HW devices. If such devices exist, it initializes these devices
   and sets up the environments to access these devices (e.g., install
   device drivers, user-space libraries, and tools).
 3. After initialization, device plugin queries HW device states through the
   installed device monitoring tools or other device interfaces. Then device
   plugin connects to the Kubelet device plugin gRPC server and sends it the
   obtained list of HW device information. In the initial prototype, the
   device resource exported by a device plugin can be implemented as an
   [extended OIR](https://github.com/kubernetes/kubernetes/pull/48922)
   with special prefix “extensions.kubernetes.io/”, plus device resource_name
   that uniquely identifies a device plugin on a node.
   Kubelet can use existing API to add this resource to API server so that the
   device resource is available for scheduling.
 4. Device plugin runs in a loop to continuously query HW device states. If it
   detects any changes, it sends the Kubelet device plugin gRPC server the new
   list of HW device information. Kubelet can use this information to update its
   device capacity states and if necessary, re-allocate new device to a user
   container with unhealthy allocated devices.
 5. When Kubelet receives an allocation request for a HW device advertised
   by a device plugin (i.e., resource with “extensions.kubernetes.io/” prefix
   plus device resource_name), it updates its internal allocation state,
   issues certain calls to CRI to bind mount the host directories where
   user-space libraries and tools are installed to the device-specific
   default directories in user Pod or set up certain environment variables,
   and checkpoints the container-to-device allocation information to persistent
   storage.
 6. When user container accessing the device finishes, Kubelet updates its
   internal state to deallocate the device, and updates its checkpoint state
   in persistent storage.
 7. When device plugin DaemonSets is removed, clean up device state (e.g., uninstall
   device driver, remove user-space libraries and tools). This step can be
   specified as a preStop
   [container lifecycle step](https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/).
   Note one implication from this approach is that device plugin upgrade process
   will be disruptive. It will need more thinkings if we want to
   support non-disruptive upgrade process.
 ## Protobuf API
 ```go
 service PluginResource {
 	rpc Register(RegisterRequest) returns (RegisterResponse) {}
 	rpc ReportDeviceInfo(ReportRequest) returns (ReportResponse) {}
 }
 message RegisterRequest {
 	// Version of the API the Device Plugin was built against
 	string version = 1;
 	// E.g., "nvidia-gpu". Used to construct OIR:
 	// “extensions.kubernetes.io/resourcename”.
 	string resourcename = 2;
 }
 message RegisterResponse {
 	bool success = 1;
 	// Kubelet fills this field with details if it encounters any errors
 	// during the registration process, e.g., for version mismatch, what
 	// is the required version and minimum supported version by kubelet.
 	string error = 2;
 }
 message ReportRequest {
 	repeated DeviceInfo devices = 1;
 }
 message DeviceInfo {
      // E.g., "GPU-fef8089b-4820-abfc-e83e-94318197576e".
      // Needs to be unique per device plugin.
 	string Id = 1;
      // E.g., UNKNOWN, HEALTHY, UNHEALTHY.
 	enum State = 2;
      // E.g., {"/rootfs/nvidia":"/usr/local/nvidia"}
      // Maps from host directory where device library or tools
      // are installed to user pod directory where the library or 
      // tools are expected to be accessed. Kubelet will use this 
      // information to bind mount host directory to the user pod
      // directory during allocation.
       map<string, string> mountpaths = 3;
      // E.g., {"LD_PRELOAD":"xxx.so"}. Kubelet will export these
      // env variables in user pod during allocation.
       map<string, string> envariables = 4;
       // E.g., {"Family":"Pascal"} {"ECC":"True"}
       // These fields can be used as node labels for selection
       map<string, string> labels = 4;
 }
 message ReportResponse {
 	bool success = 1;
 	// Kubelet fills this field if it encounters any errors
 	// during the report process, e.g., device plugin hasn’t
 	// registered yet (could happen when Kubelet restarts).
 	string error = 2;
 }
 ```
 ## Failure recovery
 * Device failure: Device plugin should be able to detect device failure and
  report that to Kubelet. Kubelet should then remove the failed device from
  available list. If there is any user container using the failed device,
  Kubelet may terminate the user container and reschedule it on a good
  available device. When a failed device recovers, device plugin will send
  Kubelet the updated device state and Kubelet can add the device to the
  available device list.
 * Kubelet crash: When Kubelet restarts after a crash, it should be able to
  recover allocation states from the checkpoints recorded on persistent storage.
  The checkpoint records should include allocated device id to pod mapping
  information as well as non-allocated device information, so Kubelet can
  re-establish precise allocation state. Device plugin should be able to
  detect Kubelet failure when it needs to update device informaiton,
  and re-registers with the new Kubelet.
 * Device plugin crash: A device plugin is deployed through DaemonSets.
  If a device plugin process fails, Kubelet will detect that and automatically
  restart it. After restart, device plugin will reconnect to Kubelet and
  report the current device states. Kubelet can compare the reported device
  information with its internal device states, and makes adjustments if
  necessary. One thing we need to pay special attention is that device plugin
  may fail at any time, e.g., during initialization. When the new device plugin
  process starts, it needs to be able to recover from incomplete states.
 ## Roadmap
 * Phase 1: device plugin is supported in alpha mode in 1.8 kubernetes release.
  Make sure it provides the following functionalities: initialize, discover,
  allocate/deallocate, cleanup, basic health check, and can recover from device,
  Kubelet or device plugin failures . Make sure the interface is kept simple and
  extensible, and the document is clear. With the initial implemented API,
  make sure we can use the interface to implement device plugin images for at
  least two types of devices: Nvidia GPU and Solarflare NIC. Note the support
  for Nvidia GPU will help gpu support to enter beta by providing a general
  and extendable api. Test coverage: e2e tests with the developed Nvidia GPU
  image and Solarflare image to make sure these devices can be correctly
  initialized, allocated, deallocated, and cleaned up. Also should test we
  can recover from device failure, Kubelet restarts, and device plugin failure.
 * Phase 2: device plugin is supported in beta mode in 1.9 kubernetes release.
  At this phase, the primary design and API should be stabilized. We need to
  implement authentication mechanism to ensure only trusted device plugin
  images can be registered. We can support device specific
  allocation/deallocation requests by having device plugin also provide a gRPC
  service or Device Plugin can instruct Kubelet to perform device specific
  operation hooks during allocation/deallocation procedures.
  Hopefully at this time, we may make good progress on supporting more flexible
  resource allocation policies in Kubernetes, and with that, we can switch
  device plugin from using OIR to using ResourceClass to allow more efficient
  HW specific resource allocations, e.g., topology aware resource allocations,
  NUMA aware resource allocations etc.
 * Phase 3: device plugin is supported in GA mode in 1.10 kubernetes release.
  Device plugin should have clear error handling and problem report that
  allows easy debuggability and monitoring on its exported devices.
  We should have clear documentation on how to develop a device plugin and
  how interact with device plugin. The framework needs to be stable and
  demonstrate good user experiences through the support on multiple types
  of devices, such as GPU, Infiniband, high-performance NIC, and etc.
 ## Open Questions
 * The proposal assumes we can omit device specific allocation/deallocation
 operations in the alpha release and support this feature in later releases.
 If people are concerned that such omission would impact the usability of
 alpha release, we will need to come up with a solution that would either
 require two-way gRPC communication between Kubelet and Device Plugin or
 Device Plugin can instruct Kubelet to perform device specific operation hooks
 during allocation/deallocation procedures.
 # Installation
 The installation process should be straightforward to the user, transparent
 and similar to other regular Kubernetes actions.
@ -366,6 +573,7 @@ as `kubeadm` they would use the examples that we would provide at:
 `https://github.com/Kubernetes/Kubernetes/tree/master/examples/device-plugin.yaml`
 YAML example:
 ```yaml
 apiVersion: extensions/v1beta1
 kind: DaemonSet
@ -388,48 +596,6 @@ spec:
                   path: /var/run/Kubernetes
 ```
 ## API Changes
 ### Device
 When discovering the devices, Kubelet will be in charge of advertising those
 resources to the API server.
 We will advertise each device returned by the Device Plugin in a new structure
 called `Device`.
 It is defined as follows:
 ```golang
 type Device struct {
 	Kind       string
 	Vendor     string
 	Name       string
 	Health     DeviceHealthStatus
 	Properties map[string]string
 }
 ```
 Because the current API (Capacity) can not be extended to support Device,
 we will need to create two new attributes in the NodeStatus structure:
  * `DevCapacity`: Describing the device capacity of the node
  * `DevAvailable`: Describing the available devices
 ```golang
 type NodeStatus struct {
 	DevCapacity []Device
 	DevAvailable []Device
 }
 ```
 We also introduce the `Allocated` field in the pod's status so that user
 can know what devices were assigned to the pod. It could also be useful in
 the case of monitoring
 ```golang
 type ContainerStatus struct {
 	Devices []Device
 }
 ```
 # Versioning
 Currently there is only one part (CRI) of Kubernetes which is based on
@ -469,3 +635,12 @@ Negotiation would take place in the registration:
   contacts the Device Plugin
 4. If the Device Plugin supports the version sent by Kubelet it can and should
   answer the different calls made by Kubelet
 ## References
  * [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
  * [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
  * [Kubernetes Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
  * [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
  * [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)