mirror of https://github.com/kubernetes/kops.git
Add some quick notes on how to get GPU opertor working
This commit is contained in:
parent
18ffb493bf
commit
c7a2183a1d
80
docs/gpu.md
80
docs/gpu.md
|
|
@ -1,5 +1,81 @@
|
||||||
# GPU Support
|
# GPU Support
|
||||||
|
|
||||||
You can use [kops hooks](./cluster_spec.md#hooks) to install [Nvidia kubernetes device plugin](https://github.com/NVIDIA/k8s-device-plugin) and enable GPU support in cluster.
|
You can use [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) to install NVIDIA device drivers and tools to your cluster.
|
||||||
|
|
||||||
See instructions in [kops hooks for nvidia-device-plugin](../hooks/nvidia-device-plugin).
|
## Creating a cluster with GPU nodes
|
||||||
|
|
||||||
|
Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the [getting started documentation](https://kops.sigs.k8s.io/getting_started/aws/).
|
||||||
|
|
||||||
|
Once the cluster is running, add an instance group with GPUs:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: kops.k8s.io/v1alpha2
|
||||||
|
kind: InstanceGroup
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
kops.k8s.io/cluster: <cluster name>
|
||||||
|
name: gpu-nodes
|
||||||
|
spec:
|
||||||
|
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
|
||||||
|
nodeLabels:
|
||||||
|
kops.k8s.io/instancegroup: gpu-nodes
|
||||||
|
machineType: g4dn.xlarge
|
||||||
|
maxSize: 1
|
||||||
|
minSize: 1
|
||||||
|
role: Node
|
||||||
|
subnets:
|
||||||
|
- eu-central-1c
|
||||||
|
taints:
|
||||||
|
- nvidia.com/gpu=present:NoSchedule
|
||||||
|
```
|
||||||
|
|
||||||
|
Note the taint used above. This will prevent pods from being scheduled on GPU nodes unless we explicitly want to. The GPU Operator resources tolerate this taint by default.
|
||||||
|
Also note the node label we set. This will be used to ensure the GPU Operator resources runs on GPU nodes.
|
||||||
|
|
||||||
|
## Install GPU Operator
|
||||||
|
GPU Operator is installed using `helm`. See the [general install instructions for GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-gpu-operator).
|
||||||
|
|
||||||
|
In order to match the _kops_ environment, create a `values.yaml` file with the following content:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
operator:
|
||||||
|
nodeSelector:
|
||||||
|
kops.k8s.io/instancegroup: gpu-nodes
|
||||||
|
|
||||||
|
driver:
|
||||||
|
nodeSelector:
|
||||||
|
kops.k8s.io/instancegroup: gpu-nodes
|
||||||
|
|
||||||
|
toolkit:
|
||||||
|
nodeSelector:
|
||||||
|
kops.k8s.io/instancegroup: gpu-nodes
|
||||||
|
|
||||||
|
devicePlugin:
|
||||||
|
nodeSelector:
|
||||||
|
kops.k8s.io/instancegroup: gpu-nodes
|
||||||
|
|
||||||
|
dcgmExporter:
|
||||||
|
nodeSelector:
|
||||||
|
kops.k8s.io/instancegroup: gpu-nodes
|
||||||
|
|
||||||
|
gfd:
|
||||||
|
nodeSelector:
|
||||||
|
kops.k8s.io/instancegroup: gpu-nodes
|
||||||
|
|
||||||
|
node-feature-discovery:
|
||||||
|
worker:
|
||||||
|
nodeSelector:
|
||||||
|
kops.k8s.io/instancegroup: gpu-nodes
|
||||||
|
```
|
||||||
|
|
||||||
|
Once you have installed the the _helm chart_ you should be able to see the GPU operator resources being spawned in the `gpu-operator-resources` namespace.
|
||||||
|
|
||||||
|
You should now be able to schedule other workloads on the GPU by adding the following properties to the pod spec:
|
||||||
|
```yaml
|
||||||
|
spec:
|
||||||
|
nodeSelector:
|
||||||
|
kops.k8s.io/instancegroup: gpu-nodes
|
||||||
|
tolerations:
|
||||||
|
- key: nvidia.com/gpu
|
||||||
|
operator: Exists
|
||||||
|
```
|
||||||
Loading…
Reference in New Issue