mirror of https://github.com/kubernetes/kops.git
86 lines
2.5 KiB
Markdown
86 lines
2.5 KiB
Markdown
# GPU Support
|
|
|
|
You can use [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) to install NVIDIA device drivers and tools to your cluster.
|
|
|
|
## Creating a cluster with GPU nodes
|
|
|
|
Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the [getting started documentation](https://kops.sigs.k8s.io/getting_started/aws/).
|
|
|
|
Once the cluster is running, add an instance group with GPUs:
|
|
|
|
```yaml
|
|
apiVersion: kops.k8s.io/v1alpha2
|
|
kind: InstanceGroup
|
|
metadata:
|
|
labels:
|
|
kops.k8s.io/cluster: <cluster name>
|
|
name: gpu-nodes
|
|
spec:
|
|
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
|
|
nodeLabels:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
machineType: g4dn.xlarge
|
|
maxSize: 1
|
|
minSize: 1
|
|
role: Node
|
|
subnets:
|
|
- eu-central-1c
|
|
taints:
|
|
- nvidia.com/gpu=present:NoSchedule
|
|
```
|
|
|
|
Note the taint used above. This will prevent pods from being scheduled on GPU nodes unless we explicitly want to. The GPU Operator resources tolerate this taint by default.
|
|
Also note the node label we set. This will be used to ensure the GPU Operator resources runs on GPU nodes.
|
|
|
|
## Install GPU Operator
|
|
GPU Operator is installed using `helm`. See the [general install instructions for GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-gpu-operator).
|
|
|
|
In order to match the kOps environment, create a `values.yaml` file with the following content:
|
|
|
|
```yaml
|
|
operator:
|
|
nodeSelector:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
|
|
driver:
|
|
nodeSelector:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
|
|
toolkit:
|
|
nodeSelector:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
|
|
devicePlugin:
|
|
nodeSelector:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
|
|
dcgmExporter:
|
|
nodeSelector:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
|
|
gfd:
|
|
nodeSelector:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
tolerations:
|
|
- key: nvidia.com/gpu
|
|
operator: Exists
|
|
effect: NoSchedule
|
|
|
|
node-feature-discovery:
|
|
worker:
|
|
nodeSelector:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
```
|
|
|
|
Once you have installed the the _helm chart_ you should be able to see the GPU operator resources being spawned in the `gpu-operator-resources` namespace.
|
|
|
|
You should now be able to schedule other workloads on the GPU by adding the following properties to the pod spec:
|
|
```yaml
|
|
spec:
|
|
nodeSelector:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
tolerations:
|
|
- key: nvidia.com/gpu
|
|
operator: Exists
|
|
```
|