Add some quick notes on how to get GPU opertor working

2020-10-18 15:38:40 +02:00 · 2020-10-18 15:38:40 +02:00 · c7a2183a1d
parent 18ffb493bf
commit c7a2183a1d
1 changed files with 78 additions and 2 deletions
--- a/docs/gpu.md
+++ b/docs/gpu.md
@ -1,5 +1,81 @@
 # GPU Support
-You can use [kops hooks](./cluster_spec.md#hooks) to install [Nvidia kubernetes device plugin](https://github.com/NVIDIA/k8s-device-plugin) and enable GPU support in cluster.
+You can use [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) to install NVIDIA device drivers and tools to your cluster.
-See instructions in [kops hooks for nvidia-device-plugin](../hooks/nvidia-device-plugin).
+## Creating a cluster with GPU nodes
 Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the [getting started documentation](https://kops.sigs.k8s.io/getting_started/aws/).
 Once the cluster is running, add an instance group with GPUs:
 ```yaml
 apiVersion: kops.k8s.io/v1alpha2
 kind: InstanceGroup
 metadata:
  labels:
    kops.k8s.io/cluster: <cluster name>
  name: gpu-nodes
 spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
  nodeLabels:
    kops.k8s.io/instancegroup: gpu-nodes
  machineType: g4dn.xlarge
  maxSize: 1
  minSize: 1
  role: Node
  subnets:
  - eu-central-1c
  taints:
  - nvidia.com/gpu=present:NoSchedule
 ```
 Note the taint used above. This will prevent pods from being scheduled on GPU nodes unless we explicitly want to. The GPU Operator resources tolerate this taint by default.
 Also note the node label we set. This will be used to ensure the GPU Operator resources runs on GPU nodes. 
 ## Install GPU Operator
 GPU Operator is installed using `helm`. See the [general install instructions for GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-gpu-operator).
 In order to match the _kops_ environment, create a `values.yaml` file with the following content:
 ```yaml
 operator:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
 driver:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
 toolkit:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
 devicePlugin:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
 dcgmExporter:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
 gfd:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
 node-feature-discovery:
  worker:
    nodeSelector:
      kops.k8s.io/instancegroup: gpu-nodes
 ```
 Once you have installed the the _helm chart_ you should be able to see the GPU operator resources being spawned in the `gpu-operator-resources` namespace.
 You should now be able to schedule other workloads on the GPU by adding the following properties to the pod spec:
 ```yaml
 spec:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
 ```