kops/docs/gpu.md

97 lines
2.7 KiB
Markdown

# GPU Support
## kOps managed device driver
{{ kops_feature_table(kops_added_default='1.22') }}
kOps can install nvidia device drivers, plugin, and runtime, as well as configure containerd to make use of the runtime.
kOps will also install a RuntimeClass `nvidia`. As the nvidia runtime is not the default runtime, you will need to add `runtimeClassName: nvidia` to any Pod spec you want to use for GPU workloads. The RuntimeClass also configures the appropriate node selectors and tolerations to run on GPU Nodes.
kOps will add `kops.k8s.io/gpu="1"` as node selector as well as the following taint:
```yaml
taints:
- effect: NoSchedule
key: nvidia.com/gpu
```
The taint will prevent you from accidentially scheduling workloads on GPU Nodes.
You can enable nvidia by adding the following to your Cluster spec:
```yaml
containerd:
nvidiaGPU:
enabled: true
```
## Creating an instance group with GPU nodeN
Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the [getting started documentation](https://kops.sigs.k8s.io/getting_started/aws/).
Once the cluster is running, add an instance group with GPUs:
```yaml
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: <cluster name>
name: gpu-nodes
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
nodeLabels:
kops.k8s.io/instancegroup: gpu-nodes
machineType: g4dn.xlarge
maxSize: 1
minSize: 1
role: Node
subnets:
- eu-central-1c
```
## GPUs in OpenStack
OpenStack does not support enabling containerd configuration in cluster level. It needs to be done in instance group:
```yaml
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: <cluster name>
name: gpu-nodes
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
nodeLabels:
kops.k8s.io/instancegroup: gpu-nodes
machineType: g4dn.xlarge
maxSize: 1
minSize: 1
role: Node
subnets:
- eu-central-1c
containerd:
nvidiaGPU:
enabled: true
```
## Verifying GPUs
1. after new GPU nodes are coming up, you should see them in `kubectl get nodes`
2. nodes should have `kops.k8s.io/gpu` label and `nvidia.com/gpu:NoSchedule` taint
3. `kube-system` namespace should have nvidia-device-plugin-daemonset pod provisioned to GPU node(s)
4. if you see `nvidia.com/gpu` in kubectl describe node <node> everything should work.
```
Capacity:
cpu: 4
ephemeral-storage: 9983232Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32796292Ki
nvidia.com/gpu: 1 <- this one
pods: 110
```