mirror of https://github.com/kubernetes/kops.git
97 lines
2.7 KiB
Markdown
97 lines
2.7 KiB
Markdown
# GPU Support
|
|
|
|
## kOps managed device driver
|
|
|
|
{{ kops_feature_table(kops_added_default='1.22') }}
|
|
|
|
kOps can install nvidia device drivers, plugin, and runtime, as well as configure containerd to make use of the runtime.
|
|
|
|
kOps will also install a RuntimeClass `nvidia`. As the nvidia runtime is not the default runtime, you will need to add `runtimeClassName: nvidia` to any Pod spec you want to use for GPU workloads. The RuntimeClass also configures the appropriate node selectors and tolerations to run on GPU Nodes.
|
|
|
|
kOps will add `kops.k8s.io/gpu="1"` as node selector as well as the following taint:
|
|
|
|
```yaml
|
|
taints:
|
|
- effect: NoSchedule
|
|
key: nvidia.com/gpu
|
|
```
|
|
|
|
The taint will prevent you from accidentially scheduling workloads on GPU Nodes.
|
|
|
|
You can enable nvidia by adding the following to your Cluster spec:
|
|
|
|
```yaml
|
|
containerd:
|
|
nvidiaGPU:
|
|
enabled: true
|
|
```
|
|
|
|
## Creating an instance group with GPU nodeN
|
|
|
|
Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the [getting started documentation](https://kops.sigs.k8s.io/getting_started/aws/).
|
|
|
|
Once the cluster is running, add an instance group with GPUs:
|
|
|
|
```yaml
|
|
apiVersion: kops.k8s.io/v1alpha2
|
|
kind: InstanceGroup
|
|
metadata:
|
|
labels:
|
|
kops.k8s.io/cluster: <cluster name>
|
|
name: gpu-nodes
|
|
spec:
|
|
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
|
|
nodeLabels:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
machineType: g4dn.xlarge
|
|
maxSize: 1
|
|
minSize: 1
|
|
role: Node
|
|
subnets:
|
|
- eu-central-1c
|
|
```
|
|
|
|
## GPUs in OpenStack
|
|
|
|
OpenStack does not support enabling containerd configuration in cluster level. It needs to be done in instance group:
|
|
|
|
```yaml
|
|
apiVersion: kops.k8s.io/v1alpha2
|
|
kind: InstanceGroup
|
|
metadata:
|
|
labels:
|
|
kops.k8s.io/cluster: <cluster name>
|
|
name: gpu-nodes
|
|
spec:
|
|
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
|
|
nodeLabels:
|
|
kops.k8s.io/instancegroup: gpu-nodes
|
|
machineType: g4dn.xlarge
|
|
maxSize: 1
|
|
minSize: 1
|
|
role: Node
|
|
subnets:
|
|
- eu-central-1c
|
|
containerd:
|
|
nvidiaGPU:
|
|
enabled: true
|
|
```
|
|
|
|
## Verifying GPUs
|
|
|
|
1. after new GPU nodes are coming up, you should see them in `kubectl get nodes`
|
|
2. nodes should have `kops.k8s.io/gpu` label and `nvidia.com/gpu:NoSchedule` taint
|
|
3. `kube-system` namespace should have nvidia-device-plugin-daemonset pod provisioned to GPU node(s)
|
|
4. if you see `nvidia.com/gpu` in kubectl describe node <node> everything should work.
|
|
|
|
```
|
|
Capacity:
|
|
cpu: 4
|
|
ephemeral-storage: 9983232Ki
|
|
hugepages-1Gi: 0
|
|
hugepages-2Mi: 0
|
|
memory: 32796292Ki
|
|
nvidia.com/gpu: 1 <- this one
|
|
pods: 110
|
|
```
|