mirror of https://github.com/kubernetes/kops.git
Add some quick notes on how to get GPU opertor working
This commit is contained in:
parent
18ffb493bf
commit
c7a2183a1d
80
docs/gpu.md
80
docs/gpu.md
|
|
@ -1,5 +1,81 @@
|
|||
# GPU Support
|
||||
|
||||
You can use [kops hooks](./cluster_spec.md#hooks) to install [Nvidia kubernetes device plugin](https://github.com/NVIDIA/k8s-device-plugin) and enable GPU support in cluster.
|
||||
You can use [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) to install NVIDIA device drivers and tools to your cluster.
|
||||
|
||||
See instructions in [kops hooks for nvidia-device-plugin](../hooks/nvidia-device-plugin).
|
||||
## Creating a cluster with GPU nodes
|
||||
|
||||
Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the [getting started documentation](https://kops.sigs.k8s.io/getting_started/aws/).
|
||||
|
||||
Once the cluster is running, add an instance group with GPUs:
|
||||
|
||||
```yaml
|
||||
apiVersion: kops.k8s.io/v1alpha2
|
||||
kind: InstanceGroup
|
||||
metadata:
|
||||
labels:
|
||||
kops.k8s.io/cluster: <cluster name>
|
||||
name: gpu-nodes
|
||||
spec:
|
||||
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
|
||||
nodeLabels:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
machineType: g4dn.xlarge
|
||||
maxSize: 1
|
||||
minSize: 1
|
||||
role: Node
|
||||
subnets:
|
||||
- eu-central-1c
|
||||
taints:
|
||||
- nvidia.com/gpu=present:NoSchedule
|
||||
```
|
||||
|
||||
Note the taint used above. This will prevent pods from being scheduled on GPU nodes unless we explicitly want to. The GPU Operator resources tolerate this taint by default.
|
||||
Also note the node label we set. This will be used to ensure the GPU Operator resources runs on GPU nodes.
|
||||
|
||||
## Install GPU Operator
|
||||
GPU Operator is installed using `helm`. See the [general install instructions for GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-gpu-operator).
|
||||
|
||||
In order to match the _kops_ environment, create a `values.yaml` file with the following content:
|
||||
|
||||
```yaml
|
||||
operator:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
|
||||
driver:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
|
||||
toolkit:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
|
||||
devicePlugin:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
|
||||
dcgmExporter:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
|
||||
gfd:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
|
||||
node-feature-discovery:
|
||||
worker:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
```
|
||||
|
||||
Once you have installed the the _helm chart_ you should be able to see the GPU operator resources being spawned in the `gpu-operator-resources` namespace.
|
||||
|
||||
You should now be able to schedule other workloads on the GPU by adding the following properties to the pod spec:
|
||||
```yaml
|
||||
spec:
|
||||
nodeSelector:
|
||||
kops.k8s.io/instancegroup: gpu-nodes
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Exists
|
||||
```
|
||||
Loading…
Reference in New Issue