Device plugin proposal patch by Jiaying
This commit is contained in:
parent
9d7a245a23
commit
4cbc77e491
Binary file not shown.
|
After Width: | Height: | Size: 67 KiB |
|
|
@ -1,99 +1,77 @@
|
|||
# Device Manager Proposal
|
||||
Device Manager Proposal
|
||||
===============
|
||||
|
||||
1. [Abstract](#abstract)
|
||||
2. [Motivation](#motivation)
|
||||
3. [Use Cases](#use-cases)
|
||||
4. [Objectives](#objectives)
|
||||
5. [Non Objectives](#non-objectives)
|
||||
6. [Stories](#stories)
|
||||
* [Vendor story](#vendor-story)
|
||||
* [User story](#user-story)
|
||||
8. [Device Plugin](#device-plugin)
|
||||
* [Protocol Overview](#protocol-overview)
|
||||
* [Protobuf specification](#protobuf-specification)
|
||||
* [Installation](#installation)
|
||||
* [API Changes](#api-changes)
|
||||
* [Versioning](#versioning)
|
||||
<!-- BEGIN MUNGE: GENERATED_TOC -->
|
||||
|
||||
- [Motivation](#motivation)
|
||||
- [Use Cases](#use-cases)
|
||||
- [Objectives](#objectives)
|
||||
- [Non Objectives](#non-objectives)
|
||||
- [Proposed Implementation 1](#proposed-implementation-1)
|
||||
- [Vendor story](#vendor-story)
|
||||
- [End User story](#end-user-story)
|
||||
- [Device Plugin](#device-plugin)
|
||||
- [Introduction](#introduction)
|
||||
- [Registration](#registration)
|
||||
- [Unix Socket](#unix-socket)
|
||||
- [Protocol Overview](#protocol-overview)
|
||||
- [Protobuf specification](#protobuf-specification)
|
||||
- [Proposed Implementation 2](#proposed-implementation-2)
|
||||
- [Device Plugin Lifecycle](#device-plugin-lifecycle)
|
||||
- [Protobuf API](#protobuf-api)
|
||||
- [Failure recovery](#failure-recovery)
|
||||
- [Roadmap](#roadmap)
|
||||
- [Open Questions](#open-questions-1)
|
||||
- [Installation](#installation)
|
||||
- [Versioning](#versioning)
|
||||
- [References](#references)
|
||||
|
||||
<!-- END MUNGE: GENERATED_TOC -->
|
||||
|
||||
_Authors:_
|
||||
|
||||
* @RenaudWasTaken - Renaud Gaubert <rgaubert@NVIDIA.com>
|
||||
|
||||
## Abstract
|
||||
|
||||
This document describes a vendor independant solution to:
|
||||
* Discovering and representing external devices
|
||||
* Making these devices available to the container and cleaning them up
|
||||
afterwards
|
||||
* Health Check of these devices
|
||||
|
||||
Because devices are vendor dependant and have their own sets of problems
|
||||
and mechanisms, the solution we describe is a plugin mechanism managed by
|
||||
Kubelet.
|
||||
|
||||
At their core, device plugins are simple gRPC servers that may run in a
|
||||
container deployed through the pod mechanism.
|
||||
|
||||
These servers implement the gRPC interface defined later in this design
|
||||
document and once the device plugin makes itself know to kubelet, kubelet
|
||||
will interact with the device through three simple functions:
|
||||
1. A `Discover` function for the kubelet to Discover the devices and
|
||||
their properties.
|
||||
2. An `Allocate` and `Deallocate` function which are called respectively
|
||||
before container creation and after container deletion with the
|
||||
devices to allocate and deallocate.
|
||||
3. A `Monitor` function to notify Kubelet whenever a device becomes
|
||||
unhealthy.
|
||||
|
||||
The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
|
||||
the simple following steps:
|
||||
* `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
|
||||
* When launching `kubectl describe nodes`, the devices appear in the node spec
|
||||
* In the long term users will be able to select them through Resource Class
|
||||
|
||||
We expect the plugins to be deployed across the clusters through DaemonSets.
|
||||
The targeted devices are GPUs, NICs, FPGAs, InfiniBand, Storage devices, ....
|
||||
|
||||
|
||||
## Motivation
|
||||
# Motivation
|
||||
|
||||
Kubernetes currently supports discovery of CPU and Memory primarily to a
|
||||
minimal extent. Very few devices are handled natively by Kubelet.
|
||||
|
||||
It is not a sustainable solution to expect every vendor to add their vendor
|
||||
specific code inside Kubernetes. This approach does not scale and is not
|
||||
portable.
|
||||
specific code inside Kubernetes to make their devices usable.
|
||||
Instead, we want a solution for vendors to be able to advertise their resources
|
||||
to Kubelet and monitor them without writing custom Kubernetes code.
|
||||
We also want to provide a consistent and portable solution for users to
|
||||
consume hardware devices across k8s clusters.
|
||||
|
||||
We want a solution for those vendors to be able to advertise their resources
|
||||
to kubelet and monitor them.
|
||||
We also want a way for the user to specify which resource their jobs will use
|
||||
and what constraints are associated to these resources.
|
||||
This document describes a vendor independant solution to:
|
||||
* Discovering and representing external devices
|
||||
* Making these devices available to the containers using these devices and
|
||||
cleaning them up afterwards
|
||||
* Monitoring these devices
|
||||
|
||||
In order to solve this problem it is obvious that we need a plugin system in
|
||||
order to have vendors advertise and monitor their resources on behalf
|
||||
of Kubelet.
|
||||
Because devices are vendor dependant and have their own sets of problems
|
||||
and mechanisms, the solution we describe is a plugin mechanism that may run
|
||||
in a container deployed through the DaemonSets mechanism.
|
||||
The targeted devices include GPUs, High-performance NICs, FPGAs, InfiniBand,
|
||||
Storage devices, and other similar computing resources that require vendor
|
||||
specific initialization and setup.
|
||||
|
||||
Additionally, we introduce the concept of Device to be able to select
|
||||
resources with constraints in a pod spec.
|
||||
The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
|
||||
the following simple steps:
|
||||
* `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
|
||||
* When launching `kubectl describe nodes`, the devices appear in the node spec
|
||||
* In the long term users will be able to select them through Resource Class
|
||||
|
||||
_GPU Integration Example:_
|
||||
* [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
|
||||
* [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
|
||||
# Use Cases
|
||||
|
||||
_Kubernetes Meeting Notes On This:_
|
||||
* [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
|
||||
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
|
||||
* [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
|
||||
* I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
|
||||
in my pod.
|
||||
* I should be able to use that device without writing custom Kubernetes code.
|
||||
* I want a consistent and portable solution to consume hardware devices
|
||||
across k8s clusters.
|
||||
|
||||
## Use Cases
|
||||
|
||||
* I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
|
||||
in my pod.
|
||||
* I should be able to use that device without writing custom Kubernetes code.
|
||||
* I want a consistent and portable solution to consume hardware devices
|
||||
across k8s clusters.
|
||||
|
||||
## Objectives
|
||||
# Objectives
|
||||
|
||||
1. Add support for vendor specific Devices in kubelet:
|
||||
* Through a pluggable mechanism.
|
||||
|
|
@ -103,16 +81,18 @@ _Kubernetes Meeting Notes On This:_
|
|||
2. Define a deployment mechanism for this new API.
|
||||
3. Define a versioning mechanism for this new API.
|
||||
|
||||
## Non Objectives
|
||||
1. Advanced scheduling and resource selection (solved through [#782](https://github.com/Kubernetes/community/pull/782)).
|
||||
# Non Objectives
|
||||
|
||||
1. Advanced scheduling and resource selection (solved through
|
||||
[#782](https://github.com/Kubernetes/community/pull/782)).
|
||||
We will only try to give basic selection primitives to the devices
|
||||
2. Metrics: this should be the job of cadvisor and should probably either be
|
||||
addressed there (cadvisor) or if people feel there is a case to be made
|
||||
for it being addressed in the Device Plugin, in a follow up proposal.
|
||||
|
||||
## Stories
|
||||
# Proposed Implementation 1
|
||||
|
||||
### Vendor story
|
||||
## Vendor story
|
||||
|
||||
Kubernetes provides to vendors a mechanism called device plugins to:
|
||||
* advertise devices.
|
||||
|
|
@ -144,7 +124,7 @@ onwn gRPC server.
|
|||
Only then will kubelet start interacting with the vendor's device plugin
|
||||
through the gRPC apis.
|
||||
|
||||
### End User story
|
||||
## End User story
|
||||
|
||||
When setting up the cluster the admin knows what kind of devices are present
|
||||
on the different machines and therefore can select what devices they want to
|
||||
|
|
@ -182,6 +162,7 @@ He might in the future be in charge of selecting the device.
|
|||
## Device Plugin
|
||||
|
||||
### Introduction
|
||||
|
||||
The device plugin is structured in 5 parts:
|
||||
1. Registration: The device plugin advertises it's presence to Kubelet
|
||||
2. Discovery: Kubelet calls the device plugin to list it's devices
|
||||
|
|
@ -189,7 +170,7 @@ The device plugin is structured in 5 parts:
|
|||
devices advertised by the device plugin, Kubelet calls the device plugin's
|
||||
`Allocate` and `Deallocate` functions.
|
||||
4. Cleanup: Kubelet terminates the communication through a "Stop"
|
||||
4. Heartbeat: The device plugin polls Kubelet to know if it's still alive
|
||||
5. Heartbeat: The device plugin polls Kubelet to know if it's still alive
|
||||
and if it has to re-issue a Register request
|
||||
|
||||
### Registration
|
||||
|
|
@ -247,7 +228,7 @@ The device plugin is also expected to periodically call the `Heartbeat` function
|
|||
exposed by Kubelet and issue a `Registration` request when it either can't reach
|
||||
Kubelet or Kubelet answers with a `KO` response.
|
||||
|
||||

|
||||

|
||||
|
||||
|
||||
### Protobuf specification
|
||||
|
|
@ -343,7 +324,233 @@ message DeviceHealth {
|
|||
}
|
||||
```
|
||||
|
||||
## Installation
|
||||
# Proposed Implementation 2
|
||||
|
||||
The main strategy of this proposed implemenation is that we want to start with
|
||||
something simple that can show benefits on our immediate use cases, yet the
|
||||
API design should be extendable to support future requirements.
|
||||
Here are the main motivations for this alternative proposed implementation:
|
||||
|
||||
* Discovery phase: can we eliminate this gRPC procedure? It seems more
|
||||
natural for device plugin to send Kubelet the discovered device information
|
||||
right after device initialization and the registration gRPC procedure.
|
||||
* The current implementation uses gRPC to communicate between Kubelet and
|
||||
device plugin. Both Kubelet and device plugin need to start a gRPC server
|
||||
for two-way communication. This seems a bit complicated. Can we have
|
||||
device plugin send enough information to Kubelet so that we only need
|
||||
Kubelet to start gRPC server and device plugin is kept as gRPC client?
|
||||
The main concern with one-way gRPC communication is that we can NOT
|
||||
support device specific operations, like reset device, during
|
||||
allocation/deallocation. Depending on how long we expect device specific
|
||||
operations to take, we can support this feature later by
|
||||
having device plugin also provide a gRPC service or Device Plugin
|
||||
can instruct Kubelet to perform device specific operation hooks.
|
||||
* Do we need checkpointing in the initial prototype implementation?
|
||||
Even in alpha release, we may still want to be able to recover from
|
||||
various failure scenario. Otherwise, it would affect user experience.
|
||||
Currently, it seems the only information we need to record somewhere
|
||||
is what device is allocated to what pod/container. There have been
|
||||
discussions on different ways and places to record this information.
|
||||
The approach taken by the current implementation pushes this information
|
||||
to ApiServer by extending NodeStatus interface between Kubelet and ApiServer.
|
||||
The major concern on this approach is that it introduced an API extension
|
||||
apart from the current model (Currently Node information recorded at ApiServer
|
||||
only contains resource capacity information. Resource allocation information
|
||||
is kept at Node). The second approach is for Kubelet to checkpoint this
|
||||
information. This seems to align with the current Kubernetes model that
|
||||
Kubelet is the component to implement allocation/deallocation functionalities.
|
||||
The information we want to checkpoint, i.e., what device is allocated to what
|
||||
pod/container, also seems generic enough to be implemented at Kubelet.
|
||||
It may also allow other use cases outside device plugin, e.g., cpu pin.
|
||||
The third approach is to implement this in device plugin. This way,
|
||||
device plugin can also record any state information useful to its own
|
||||
failure recovery in checkpoints. One concern on this approach is that it
|
||||
may add more burdens on vendors to implement their device plugin images.
|
||||
Surprises might happen in production if things were not implemented correctly.
|
||||
It also seems apart from the current model as today Kubelet is the place
|
||||
where allocation/deallocation happens for other types of resources.
|
||||
* Heartbeat: do we need this to make sure connections can be re-established
|
||||
between kubelet and device plugin? Can we reuse keepalive feature from gRPC?
|
||||
Or if Kubelet checkpoints device allocation state information, device plugin
|
||||
may only need to detect Kubelet failure when it needs to update device
|
||||
information. Or can device plugin send periodic device state updates
|
||||
(this may be needed anyway if we want to collect device usage stats)
|
||||
and use that to detect Kubelet failure or device plugin failure?
|
||||
|
||||
## Device Plugin Lifecycle
|
||||
|
||||

|
||||
|
||||
1. User or cluster admin push vendor-specific device plugin DaemonSets.
|
||||
The DaemonSets YAML config includes mountPaths to the host directories
|
||||
where device driver, user-space libraries, and tools will be installed.
|
||||
2. After device plugin container is brought up, it detects the specific
|
||||
types of HW devices. If such devices exist, it initializes these devices
|
||||
and sets up the environments to access these devices (e.g., install
|
||||
device drivers, user-space libraries, and tools).
|
||||
3. After initialization, device plugin queries HW device states through the
|
||||
installed device monitoring tools or other device interfaces. Then device
|
||||
plugin connects to the Kubelet device plugin gRPC server and sends it the
|
||||
obtained list of HW device information. In the initial prototype, the
|
||||
device resource exported by a device plugin can be implemented as an
|
||||
[extended OIR](https://github.com/kubernetes/kubernetes/pull/48922)
|
||||
with special prefix “extensions.kubernetes.io/”, plus device resource_name
|
||||
that uniquely identifies a device plugin on a node.
|
||||
Kubelet can use existing API to add this resource to API server so that the
|
||||
device resource is available for scheduling.
|
||||
4. Device plugin runs in a loop to continuously query HW device states. If it
|
||||
detects any changes, it sends the Kubelet device plugin gRPC server the new
|
||||
list of HW device information. Kubelet can use this information to update its
|
||||
device capacity states and if necessary, re-allocate new device to a user
|
||||
container with unhealthy allocated devices.
|
||||
5. When Kubelet receives an allocation request for a HW device advertised
|
||||
by a device plugin (i.e., resource with “extensions.kubernetes.io/” prefix
|
||||
plus device resource_name), it updates its internal allocation state,
|
||||
issues certain calls to CRI to bind mount the host directories where
|
||||
user-space libraries and tools are installed to the device-specific
|
||||
default directories in user Pod or set up certain environment variables,
|
||||
and checkpoints the container-to-device allocation information to persistent
|
||||
storage.
|
||||
6. When user container accessing the device finishes, Kubelet updates its
|
||||
internal state to deallocate the device, and updates its checkpoint state
|
||||
in persistent storage.
|
||||
7. When device plugin DaemonSets is removed, clean up device state (e.g., uninstall
|
||||
device driver, remove user-space libraries and tools). This step can be
|
||||
specified as a preStop
|
||||
[container lifecycle step](https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/).
|
||||
Note one implication from this approach is that device plugin upgrade process
|
||||
will be disruptive. It will need more thinkings if we want to
|
||||
support non-disruptive upgrade process.
|
||||
|
||||
## Protobuf API
|
||||
|
||||
```go
|
||||
|
||||
service PluginResource {
|
||||
rpc Register(RegisterRequest) returns (RegisterResponse) {}
|
||||
rpc ReportDeviceInfo(ReportRequest) returns (ReportResponse) {}
|
||||
}
|
||||
|
||||
message RegisterRequest {
|
||||
// Version of the API the Device Plugin was built against
|
||||
string version = 1;
|
||||
// E.g., "nvidia-gpu". Used to construct OIR:
|
||||
// “extensions.kubernetes.io/resourcename”.
|
||||
string resourcename = 2;
|
||||
}
|
||||
|
||||
message RegisterResponse {
|
||||
bool success = 1;
|
||||
// Kubelet fills this field with details if it encounters any errors
|
||||
// during the registration process, e.g., for version mismatch, what
|
||||
// is the required version and minimum supported version by kubelet.
|
||||
string error = 2;
|
||||
}
|
||||
|
||||
message ReportRequest {
|
||||
repeated DeviceInfo devices = 1;
|
||||
}
|
||||
|
||||
message DeviceInfo {
|
||||
// E.g., "GPU-fef8089b-4820-abfc-e83e-94318197576e".
|
||||
// Needs to be unique per device plugin.
|
||||
string Id = 1;
|
||||
// E.g., UNKNOWN, HEALTHY, UNHEALTHY.
|
||||
enum State = 2;
|
||||
// E.g., {"/rootfs/nvidia":"/usr/local/nvidia"}
|
||||
// Maps from host directory where device library or tools
|
||||
// are installed to user pod directory where the library or
|
||||
// tools are expected to be accessed. Kubelet will use this
|
||||
// information to bind mount host directory to the user pod
|
||||
// directory during allocation.
|
||||
map<string, string> mountpaths = 3;
|
||||
// E.g., {"LD_PRELOAD":"xxx.so"}. Kubelet will export these
|
||||
// env variables in user pod during allocation.
|
||||
map<string, string> envariables = 4;
|
||||
// E.g., {"Family":"Pascal"} {"ECC":"True"}
|
||||
// These fields can be used as node labels for selection
|
||||
map<string, string> labels = 4;
|
||||
}
|
||||
|
||||
message ReportResponse {
|
||||
bool success = 1;
|
||||
// Kubelet fills this field if it encounters any errors
|
||||
// during the report process, e.g., device plugin hasn’t
|
||||
// registered yet (could happen when Kubelet restarts).
|
||||
string error = 2;
|
||||
}
|
||||
```
|
||||
|
||||
## Failure recovery
|
||||
|
||||
* Device failure: Device plugin should be able to detect device failure and
|
||||
report that to Kubelet. Kubelet should then remove the failed device from
|
||||
available list. If there is any user container using the failed device,
|
||||
Kubelet may terminate the user container and reschedule it on a good
|
||||
available device. When a failed device recovers, device plugin will send
|
||||
Kubelet the updated device state and Kubelet can add the device to the
|
||||
available device list.
|
||||
* Kubelet crash: When Kubelet restarts after a crash, it should be able to
|
||||
recover allocation states from the checkpoints recorded on persistent storage.
|
||||
The checkpoint records should include allocated device id to pod mapping
|
||||
information as well as non-allocated device information, so Kubelet can
|
||||
re-establish precise allocation state. Device plugin should be able to
|
||||
detect Kubelet failure when it needs to update device informaiton,
|
||||
and re-registers with the new Kubelet.
|
||||
* Device plugin crash: A device plugin is deployed through DaemonSets.
|
||||
If a device plugin process fails, Kubelet will detect that and automatically
|
||||
restart it. After restart, device plugin will reconnect to Kubelet and
|
||||
report the current device states. Kubelet can compare the reported device
|
||||
information with its internal device states, and makes adjustments if
|
||||
necessary. One thing we need to pay special attention is that device plugin
|
||||
may fail at any time, e.g., during initialization. When the new device plugin
|
||||
process starts, it needs to be able to recover from incomplete states.
|
||||
|
||||
## Roadmap
|
||||
|
||||
* Phase 1: device plugin is supported in alpha mode in 1.8 kubernetes release.
|
||||
Make sure it provides the following functionalities: initialize, discover,
|
||||
allocate/deallocate, cleanup, basic health check, and can recover from device,
|
||||
Kubelet or device plugin failures . Make sure the interface is kept simple and
|
||||
extensible, and the document is clear. With the initial implemented API,
|
||||
make sure we can use the interface to implement device plugin images for at
|
||||
least two types of devices: Nvidia GPU and Solarflare NIC. Note the support
|
||||
for Nvidia GPU will help gpu support to enter beta by providing a general
|
||||
and extendable api. Test coverage: e2e tests with the developed Nvidia GPU
|
||||
image and Solarflare image to make sure these devices can be correctly
|
||||
initialized, allocated, deallocated, and cleaned up. Also should test we
|
||||
can recover from device failure, Kubelet restarts, and device plugin failure.
|
||||
* Phase 2: device plugin is supported in beta mode in 1.9 kubernetes release.
|
||||
At this phase, the primary design and API should be stabilized. We need to
|
||||
implement authentication mechanism to ensure only trusted device plugin
|
||||
images can be registered. We can support device specific
|
||||
allocation/deallocation requests by having device plugin also provide a gRPC
|
||||
service or Device Plugin can instruct Kubelet to perform device specific
|
||||
operation hooks during allocation/deallocation procedures.
|
||||
Hopefully at this time, we may make good progress on supporting more flexible
|
||||
resource allocation policies in Kubernetes, and with that, we can switch
|
||||
device plugin from using OIR to using ResourceClass to allow more efficient
|
||||
HW specific resource allocations, e.g., topology aware resource allocations,
|
||||
NUMA aware resource allocations etc.
|
||||
* Phase 3: device plugin is supported in GA mode in 1.10 kubernetes release.
|
||||
Device plugin should have clear error handling and problem report that
|
||||
allows easy debuggability and monitoring on its exported devices.
|
||||
We should have clear documentation on how to develop a device plugin and
|
||||
how interact with device plugin. The framework needs to be stable and
|
||||
demonstrate good user experiences through the support on multiple types
|
||||
of devices, such as GPU, Infiniband, high-performance NIC, and etc.
|
||||
|
||||
## Open Questions
|
||||
|
||||
* The proposal assumes we can omit device specific allocation/deallocation
|
||||
operations in the alpha release and support this feature in later releases.
|
||||
If people are concerned that such omission would impact the usability of
|
||||
alpha release, we will need to come up with a solution that would either
|
||||
require two-way gRPC communication between Kubelet and Device Plugin or
|
||||
Device Plugin can instruct Kubelet to perform device specific operation hooks
|
||||
during allocation/deallocation procedures.
|
||||
|
||||
# Installation
|
||||
|
||||
The installation process should be straightforward to the user, transparent
|
||||
and similar to other regular Kubernetes actions.
|
||||
|
|
@ -366,6 +573,7 @@ as `kubeadm` they would use the examples that we would provide at:
|
|||
`https://github.com/Kubernetes/Kubernetes/tree/master/examples/device-plugin.yaml`
|
||||
|
||||
YAML example:
|
||||
|
||||
```yaml
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: DaemonSet
|
||||
|
|
@ -388,48 +596,6 @@ spec:
|
|||
path: /var/run/Kubernetes
|
||||
```
|
||||
|
||||
## API Changes
|
||||
### Device
|
||||
|
||||
When discovering the devices, Kubelet will be in charge of advertising those
|
||||
resources to the API server.
|
||||
|
||||
We will advertise each device returned by the Device Plugin in a new structure
|
||||
called `Device`.
|
||||
It is defined as follows:
|
||||
|
||||
```golang
|
||||
type Device struct {
|
||||
Kind string
|
||||
Vendor string
|
||||
Name string
|
||||
Health DeviceHealthStatus
|
||||
Properties map[string]string
|
||||
}
|
||||
```
|
||||
|
||||
Because the current API (Capacity) can not be extended to support Device,
|
||||
we will need to create two new attributes in the NodeStatus structure:
|
||||
* `DevCapacity`: Describing the device capacity of the node
|
||||
* `DevAvailable`: Describing the available devices
|
||||
|
||||
```golang
|
||||
type NodeStatus struct {
|
||||
DevCapacity []Device
|
||||
DevAvailable []Device
|
||||
}
|
||||
```
|
||||
|
||||
We also introduce the `Allocated` field in the pod's status so that user
|
||||
can know what devices were assigned to the pod. It could also be useful in
|
||||
the case of monitoring
|
||||
|
||||
```golang
|
||||
type ContainerStatus struct {
|
||||
Devices []Device
|
||||
}
|
||||
```
|
||||
|
||||
# Versioning
|
||||
|
||||
Currently there is only one part (CRI) of Kubernetes which is based on
|
||||
|
|
@ -469,3 +635,12 @@ Negotiation would take place in the registration:
|
|||
contacts the Device Plugin
|
||||
4. If the Device Plugin supports the version sent by Kubelet it can and should
|
||||
answer the different calls made by Kubelet
|
||||
|
||||
## References
|
||||
|
||||
* [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
|
||||
* [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
|
||||
* [Kubernetes Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
|
||||
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
|
||||
* [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue