mirror of https://github.com/kubernetes/kops.git
Add docs on gpu
This commit is contained in:
parent
deeda7137d
commit
1ceb35ad05
104
docs/gpu.md
104
docs/gpu.md
|
|
@ -1,8 +1,32 @@
|
|||
# GPU Support
|
||||
|
||||
You can use [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) to install NVIDIA device drivers and tools to your cluster.
|
||||
## kOps managed device driver
|
||||
|
||||
## Creating a cluster with GPU nodes
|
||||
{{ kops_feature_table(kops_added_default='1.22') }}
|
||||
|
||||
kOps can install nvidia device drivers, plugin, and runtime, as well as configure containerd to make use of the runtime.
|
||||
|
||||
kOps will also install a RuntimeClass `nvidia`. As the nvidia runtime is not the default runtime, you will need to add `runtimeClassName: nvidia` to any Pod spec you want to use for GPU workloads. The RuntimeClass also configures the appropriate node selectors and tolerations to run on GPU Nodes.
|
||||
|
||||
kOps will add `kops.k8s.io/gpu="1"` as node selector as well as the following taint:
|
||||
|
||||
```yaml
|
||||
taints:
|
||||
- effect: NoSchedule
|
||||
key: nvidia.com/gpu
|
||||
```
|
||||
|
||||
The taint will prevent you from accidentially scheduling workloads on GPU Nodes.
|
||||
|
||||
You can enable nvidia by adding the following to your Cluster spec:
|
||||
|
||||
```yaml
|
||||
containerd:
|
||||
nvidiaGPU:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
## Creating an instance group with GPU nodeN
|
||||
|
||||
Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the [getting started documentation](https://kops.sigs.k8s.io/getting_started/aws/).
|
||||
|
||||
|
|
@ -25,78 +49,4 @@ spec:
|
|||
role: Node
|
||||
subnets:
|
||||
- eu-central-1c
|
||||
taints:
|
||||
- nvidia.com/gpu=present:NoSchedule
|
||||
```
|
||||
|
||||
Note the taint used above. This will prevent pods from being scheduled on GPU nodes unless we explicitly want to. The GPU Operator resources tolerate this taint by default.
|
||||
Also note the node label we set. This will be used to ensure the GPU Operator resources runs on GPU nodes.
|
||||
|
||||
## Install GPU Operator
|
||||
GPU Operator is installed using `helm`. See the [general install instructions for GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-gpu-operator).
|
||||
|
||||
In order to match the kOps environment, create a `values.yaml` file with the following content:
|
||||
|
||||
```yaml
|
||||
operator:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Exists
|
||||
|
||||
driver:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Exists
|
||||
|
||||
toolkit:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Exists
|
||||
|
||||
devicePlugin:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Exists
|
||||
|
||||
dcgmExporter:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Exists
|
||||
|
||||
gfd:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Exists
|
||||
|
||||
node-feature-discovery:
|
||||
worker:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Exists
|
||||
```
|
||||
|
||||
Once you have installed the the _helm chart_ you should be able to see the GPU operator resources being spawned in the `gpu-operator-resources` namespace.
|
||||
|
||||
You should now be able to schedule other workloads on the GPU by adding the following properties to the pod spec:
|
||||
```yaml
|
||||
spec:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Exists
|
||||
```
|
||||
```
|
||||
|
|
@ -34,6 +34,9 @@ spec:
|
|||
|
||||
Currently this is only available using the AWS cloud provider.
|
||||
|
||||
## Managed nvidia instances
|
||||
|
||||
kOps can now provision instances with nvidia GPUs and configure it for container workloads without the need of hooks and operators. See [GPU support](https://kops.sigs.k8s.io/gpu/)
|
||||
|
||||
## Other significant changes
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue