Addressed comments, added protocol overview, explained impl differences

2017-07-20 16:25:56 -07:00 · 2017-07-20 16:25:56 -07:00 · f1b462b123
parent 4cbc77e491
commit f1b462b123
3 changed files with 238 additions and 90 deletions
--- a/contributors/design-proposals/device-plugin-overview.png
+++ b/contributors/design-proposals/device-plugin-overview.png
--- a/contributors/design-proposals/device-plugin.md
+++ b/contributors/design-proposals/device-plugin.md
@ -3,30 +3,31 @@ Device Manager Proposal
 <!-- BEGIN MUNGE: GENERATED_TOC -->
- [Motivation](#motivation)
+* [Motivation](#motivation)
- [Use Cases](#use-cases)
+* [Use Cases](#use-cases)
- [Objectives](#objectives)
+* [Objectives](#objectives)
- [Non Objectives](#non-objectives)
+* [Non Objectives](#non-objectives)
- [Proposed Implementation 1](#proposed-implementation-1)
+* [Proposed Implementation 1](#proposed-implementation-1)
-  - [Vendor story](#vendor-story)
+  * [Vendor story](#vendor-story)
-  - [End User story](#end-user-story)
+  * [End User story](#end-user-story)
-  - [Device Plugin](#device-plugin)
+  * [Device Plugin](#device-plugin)
-    - [Introduction](#introduction)
+  * [Introduction](#introduction)
-    - [Registration](#registration)
+  * [Registration](#registration)
-    - [Unix Socket](#unix-socket)
+  * [Unix Socket](#unix-socket)
-    - [Protocol Overview](#protocol-overview)
+  * [Protocol Overview](#protocol-overview)
-    - [Protobuf specification](#protobuf-specification)
+  * [Protobuf specification](#protobuf-specification)
- [Proposed Implementation 2](#proposed-implementation-2)
+  * [HealthCheck and Failure Recovery](#healthcheck-and-failure-recovery)
-  - [Device Plugin Lifecycle](#device-plugin-lifecycle)
+  * [API Changes](#api-changes)
-  - [Protobuf API](#protobuf-api)
+  * [Upgrading your cluster](#upgrading-your-cluster)
-  - [Failure recovery](#failure-recovery)
+* [Proposed Implementation 2](#proposed-implementation-2)
-  - [Roadmap](#roadmap)
+  * [Device Plugin Lifecycle](#device-plugin-lifecycle)
-  - [Open Questions](#open-questions-1)
+  * [Protobuf API](#protobuf-api)
- [Installation](#installation)
+  * [Failure recovery](#failure-recovery)
- [Versioning](#versioning)
+  * [Roadmap](#roadmap)
-  - [References](#references)
+  * [Open Questions](#open-questions-1)
-
+* [Installation](#installation)
-<!-- END MUNGE: GENERATED_TOC -->
+* [Versioning](#versioning)
 * [References](#references)
 _Authors:_
@ -48,7 +49,7 @@ This document describes a vendor independant solution to:
  * Discovering and representing external devices
  * Making these devices available to the containers using these devices and
    cleaning them up afterwards
-  * Monitoring these devices
+  * Health Check of these devices
 Because devices are vendor dependant and have their own sets of problems
 and mechanisms, the solution we describe is a plugin mechanism that may run
@ -85,33 +86,43 @@ the following simple steps:
 1. Advanced scheduling and resource selection (solved through
   [#782](https://github.com/Kubernetes/community/pull/782)).
-   We will only try to give basic selection primitives to the devices
+2. Collecting metrics is not part of this proposal. We will only solve
-2. Metrics: this should be the job of cadvisor and should probably either be
+   Health Check.
   addressed there (cadvisor) or if people feel there is a case to be made
   for it being addressed in the Device Plugin, in a follow up proposal.
 # Proposed Implementation 1
 ## TLDR
 At their core, device plugins are simple gRPC servers that may run in a
 container deployed through the pod mechanism.
 These servers implement the gRPC interface defined later in this design
 document and once the device plugin makes itself known to kubelet, kubelet
 will interact with the device through three simple functions:
  1. A `ListDevices` function for the kubelet to Discover the devices and
     their properties.
  2. An `Allocate` function which is called before container creation
  3. A `HealthCheck` function to notify Kubelet whenever a device becomes
     unhealthy.
 ![Process](device-plugin-overview.png)
 ## Vendor story
 Kubernetes provides to vendors a mechanism called device plugins to:
  * advertise devices.
  * monitor devices (currently perform health checks).
-  * hook into the runtime to instruct Kubelet what are the steps to
+  * hook into the runtime to execute device specific instructions
-    take in order to make the device available (or cleanup the device).
+    (e.g: Clean GPU memory) and instruct Kubelet what are the steps
    to take in order to make the device available in the container.
-A device plugin at it's core is a simple gRPC server usually running in
+```go
 a container and deployed across clusters through a daemonSet.
 ```gRPC
 service DevicePlugin {
-	rpc Discover(Empty) returns (stream Device) {}
+	rpc ListDevices(Empty) returns (stream Device) {}
-	rpc Monitor(Empty) returns (stream DeviceHealth) {}
+	rpc HealthCheck(Empty) returns (stream Device) {}
 	rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
 	rpc Deallocate(DeallocateRequest) returns (Empty) {}
 }
 ```
 The gRPC server that the device plugin must implement is expected to
@ -120,44 +131,44 @@ be advertised on a unix socket in a mounted hostPath (e.g:
 Finally, to notify Kubelet of the existence of the device plugin,
 the vendor's device plugin will have to make a request to Kubelet's
-onwn gRPC server.
+own gRPC server.
 Only then will kubelet start interacting with the vendor's device plugin
 through the gRPC apis.
 ## End User story
-When setting up the cluster the admin knows what kind of devices are present
+When setting up the cluster the admins knows what kind of devices are present
-on the different machines and therefore can select what devices they want to
+on the different machines and therefore can select what devices he want to
 enable.
 The cluster admins knows his cluster has NVIDIA GPUs therefore he deploys
 the NVIDIA device plugin through:
-`kubectl create -f NVIDIA.io/device-plugin.yml`
+`kubectl create -f nvidia.io/device-plugin.yml`
 The device plugin lands on all the nodes of the cluster and if it detects that
 there are no GPUs it terminates. However, when there are GPUs it reports them
-to Kubelet.
+to Kubelet and starts it's gRPC server to monitor devices and hook into the
-For device plugins reporting non-GPU Devices these are advertised as
+container creation process.
 OIRs and selected through the same method.
-1. A user submits a pod spec requesting X GPUs (or devices)
+Device Plugins reporting non-GPU Devices are advertised as OIRs of the shape
 `extensions.kubernetes.io/vendor-device` GPUs are advertised as `nvidia-gpu`.
 Devices can be selected using the same process as for OIRs in the pod spec.
 1. A user submits a pod spec requesting X GPUs (or devices) through OIR
 2. The scheduler filters the nodes which do not match the resource requests
 3. The pod lands on the node and Kubelet decides which device
   should be assigned to the pod
 4. Kubelet calls `Allocate` on the matching Device Plugins
 5. The user deletes the pod or the pod terminates
 6. Kubelet calls `Deallocate` on the matching Device Plugins
 When receiving a pod which requests Devices kubelet is in charge of:
-  * deciding which device to assign to the pod's containers (this will
+  * deciding which device to assign to the pod's containers 
-    change in the future)
+    * Note: This will be decided in the future at the scheduler level as
-  * advertising the changes to the node's `Available` list
+      part of the Resource Class proposal
  * advertising the changes to the pods's `Allocated` list
  * Calling the `Allocate` function with the list of devices
-The scheduler is still be in charge of filtering the nodes which cannot
+The scheduler is still in charge of filtering the nodes which cannot
 satisfy the resource requests.
 He might in the future be in charge of selecting the device.
 ## Device Plugin
@ -165,13 +176,16 @@ He might in the future be in charge of selecting the device.
 The device plugin is structured in 5 parts:
 1. Registration: The device plugin advertises it's presence to Kubelet
-2. Discovery: Kubelet calls the device plugin to list it's devices
+2. ListDevices: Kubelet calls the device plugin to list it's devices
-3. Allocate / Deallocate: When creating/deleting containers requesting the
+3. HealthCheck: The device plugin returns a stream on which it writes when
-   devices advertised by the device plugin, Kubelet calls the device plugin's
+   a device's health changes
-   `Allocate` and `Deallocate` functions.
+4. Allocate: When creating containers, Kubelet calls the device plugin's
-4. Cleanup: Kubelet terminates the communication through a "Stop"
+   `Allocate` function so that it can run device specific instructions (gpu
-5. Heartbeat: The device plugin polls Kubelet to know if it's still alive
+    cleanup, QRNG initialization, ...) and instruct Kubelet how to make the
-   and if it has to re-issue a Register request
+    device available in the container.
 5. Heartbeat: The device plugin polls every 5s Kubelet to know if it's still
   alive and if it has to re-issue a Register request (e.g: Kubelet crashed
   between two heartbeats)
 ### Registration
@ -183,7 +197,7 @@ sockets and follow this simple pattern:
 1. The device plugins starts it's gRPC server
 2. The device plugins sends a `RegisterRequest` to Kubelet (through a
   gRPC request)
-4. Kubelet starts it's Discovery phase and calls `Discover` and `Monitor`
+4. Kubelet starts it's Discovery phase and calls `ListDevices` and `HealthCheck`
 5. Kubelet answers to the `RegisterRequest` with a `RegisterResponse`
   containing any error Kubelet might have encountered
@ -192,7 +206,7 @@ sockets and follow this simple pattern:
 Device Plugins are expected to communicate with Kubelet through gRPC
 on an Unix socket.
 When starting the gRPC server, they are expected to create a unix socket
-at the following host path: `/var/run/Kubernetes`.
+at the following host path: `/var/run/kubernetes`.
 For non bare metal device plugin this means they will have to mount the folder
 as a volume in their pod spec ([see Installation](##installation)).
@ -217,12 +231,10 @@ not there was an error. The errors may include (but not limited to):
  * Vendor is not consistent across discovered devices
 Kubelet will then interact with the plugin through the following functions:
-  * `Discover`: List Devices
+  * `ListDevices`: List Devices
-  * `Monitor`: Returns a stream that is written to when a
+  * `HealthCheck`: Returns a stream that is written to when a Device becomes
-     Device becomes unhealty
+     unhealty
  * `Allocate`: Called when creating a container with a list of devices
     can request changes to the Container config
  * `Deallocate`: Called when deleting a container can be used for cleanup
 The device plugin is also expected to periodically call the `Heartbeat` function
 exposed by Kubelet and issue a `Registration` request when it either can't reach
@ -240,11 +252,10 @@ service PluginRegistration {
 }
 service DevicePlugin {
-	rpc Discover(Empty) returns (stream Device) {}
+	rpc ListDevices(Empty) returns (stream Device) {}
-	rpc Monitor(Empty) returns (stream DeviceHealth) {}
+	rpc HealthCheck(Empty) returns (stream Device) {}
 	rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
 	rpc Deallocate(DeallocateRequest) returns (Empty) {}
 }
 message RegisterRequest {
@ -262,7 +273,7 @@ message RegisterResponse {
 	string version = 1;
 	// Kubelet fills this field if it encounters any errors
 	// during the registration process or discover process
-	Error error = 2;
+	string error = 2;
 }
 message HeartbeatRequest {
@ -274,7 +285,7 @@ message HeartbeatResponse {
 	// plugin to either re-register itself or not
 	string response = 1;
 	// Kubelet fills this field if it encountered any errors
-	Error error = 2;
+	string error = 2;
 }
 message AllocateRequest {
@ -288,26 +299,17 @@ message AllocateResponse {
 	repeated Mount mounts = 2;
 }
 message DeallocateRequest {
 	repeated Device devices = 1;
 }
 message Error {
 	bool error = 1;
 	string reason = 2;
 }
 // E.g:
 // struct Device {
 //    Kind: "NVIDIA-gpu"
-//    Name: "GPU-fef8089b-4820-abfc-e83e-94318197576e"
+//    Name: "GPU-fef8089b-4820-abfc-e83e-94318197576e",
 //    Health: "Healthy",
 //    Properties: {
 //        "Family": "Pascal",
 //        "Memory": "4G",
 //        "ECC"   : "True",
 //    }
 //}
 //
 message Device {
 	string Kind = 1;
 	string Name = 2;
@ -315,15 +317,161 @@ message Device {
 	string Vendor = 4;
 	map<string, string> properties = 5; // Could be [1, 1.2, 1G]
 }
 ```
-message DeviceHealth {
+### HealthCheck and Failure Recovery
-	string Name = 1;
+
-	string Kind = 2;
+We want Kubelet as well as the Device Plugins to recover from failures
-	string Vendor = 4;
+that may happen on any side of this protocol.
-	string Health = 3;
+
 At the communication level, gRPC is a very strong piece of software and
 is able to ensure that if failure happens it will try it's best to recover
 through exponential backoff reconnection.
 The proposed mechanism intends to replace any device specific handling in
 Kubelet. Therefore in general, device plugin failure or upgrade means that
 Kubelet is not able to accept any pod requesting a Device until the upgrade
 or failure finishes.
 If a device fails, the Device Plugin should signal that through the HealthCheck
 stream and we expect Kubelet to stop the pod and reschedule it.
 If any Device Plugin fails the behavior we expect depends on the task Kubelet
 is performing:
 * In general we expect Kubelet to remove any devices that are owned by the failed
  device plugin from the resources advertised by the Node status.
 * We however do not expect Kubelet to fail or restart any pods or containers
  running that are using these devices.
 * If Kubelet is in the process of allocating a device, then it should fail
  the container process and reschedule the Pod.
 If the Kubelet fails or restarts, we expect the Device Plugins to know about
 it through Kubelet's Heartbeat call which every Device Plugin should call
 every 5s.
 When Kubelet fails or restarts it should know what are the devices that are
 owned by the different containers and be able to rebuild a list of available
 devices.
 In the current design, instead of checkpointing this data, we propose to save
 this in the API server as this gives introspection capabilities to the user
 has minimal impact on performances and is a minimal change that can be
 reverted if we decide to implement checkpointing or a debug API later.
 If Kubelet failed and recovered between two Heartbeat we are expecting it
 to answer with a HeartbeatKo answer. Signaling the device plugins to register
 themselves again against the Kubelet (in case of heartbeat failure
 or connection error).
 ### API Changes
 When discovering the devices, Kubelet will be in charge of advertising those
 resources to the API server as part of the kubelet node update current protocol.
 We will advertise each device returned by the Device Plugin in a new structure
 called `Device`.
 It is defined as follows:
 ```golang
 // E.g:
 // struct Device {
 //    Kind: "NVIDIA-gpu"
 //    Name: "GPU-fef8089b-4820-abfc-e83e-94318197576e"
 //    Health: "Healthy",
 //    Properties: {
 //        "Family": "Pascal",
 //        "Memory": "4G",
 //        "ECC"   : "True",
 //    }
 //}
 type Device struct {
 	Kind       string
 	Vendor     string
 	Name       string
 	Health     DeviceHealthStatus
 	Properties map[string]string
 }
 ```
 Because the current API (Capacity) can not be extended to support Device,
 we will need to create one new attribute in the NodeStatus structure:
  * `DevCapacity`: Describing the device capacity of the node
 ```golang
 type NodeStatus struct {
 	DevCapacity []Device
 }
 ```
 We also introduce the `Devices` field in the Containers status so that user
 can know what devices were assigned to the container.
 ```golang
 type ContainerStatus struct {
 	Devices []Device
 }
 ```
 Note that we will be using OIR to schedule and trigger the device plugin
 in parallel.
 So when a Device plugin registers two `foo-device` the node status will be
 updated to advertise 2 `extensions.kubernetes.io/foo-device`.
 If a user wants to trigger the device plugin he only needs to request this
 OIR in his Pod Spec.
 ## Upgrading your cluster
 TLDR: If you are upgrading either Kubelet or any device plugin the safest way
 is to drain the node of all pods and upgrade.
 However depending on what you are upgrading and what changes happened then it
 is completely possible to only restart just Kubelet or just the device plugin.
 ### Upgrading Kubelet
 This assumes that the Device Plugins running on the nodes fully implement the
 protocol and are able to recover from a Kubelet crash.
 Then, as long as the Device Plugin API does not change upgrading Kubelet can be done
 seamlessly through a Kubelet restart.
 However, as mentioned in the Versioning section, we currently expect the Device
 Plugin's API version to match exactly the Kubelet's Device Plugin API version.
 Therefore if the Device Plugin API version change then you will have to change
 the Device Plugin too.
 Consider draining the node in that case.
 When the Device Plugin API becomes a stable feature, versionning should be
 backward compatible and even if Kubelet has a different Device Plugin API,
 it should not require a Device Plugin upgrade.
 ### Upgrading Device Plugins
 Because we cannot enforce what the different Device Plugins will do, we cannot
 say for certain that upgrading a device plugin will not crash any containers
 on the node.
 It is therefore up to the Device Plugin vendors to specify if the Device Plugins
 can be upgraded without impacting any running containers.
 As mentioned earlier, the safest way is to drain the node before upgrading
 the Device Plugins.
 ## Difference Between Implementations
 The main difference between implementation 1 and 2 are:
 * This implementation allows vendors to run device specific code before
  starting the containers requesting these devices.
 * This implementation allows users to know what devices were assigned
  to a container
 * This implementation does not need checkpointing
 * This implementation has a clear separation of concerns, every functions
  does one thing and only one. Every actor has only one explicit role:
  * Kubelet's gRPC is in charge of keeping track of Device Plugins
  * The Device Plugin's gRPC is in charge of handling devices
 # Proposed Implementation 2
 The main strategy of this proposed implemenation is that we want to start with
@ -636,7 +784,7 @@ Negotiation would take place in the registration:
 4. If the Device Plugin supports the version sent by Kubelet it can and should
   answer the different calls made by Kubelet
-## References
+# References
  * [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
  * [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
--- a/contributors/design-proposals/device-plugin.png
+++ b/contributors/design-proposals/device-plugin.png