Add docs on gpu

2021-09-05 22:25:52 +02:00 · 2021-09-05 22:25:52 +02:00 · 1ceb35ad05
parent deeda7137d
commit 1ceb35ad05
2 changed files with 30 additions and 77 deletions
--- a/docs/gpu.md
+++ b/docs/gpu.md
@ -1,8 +1,32 @@
 # GPU Support
-You can use [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) to install NVIDIA device drivers and tools to your cluster.
+## kOps managed device driver
-## Creating a cluster with GPU nodes
+{{ kops_feature_table(kops_added_default='1.22') }}
 kOps can install nvidia device drivers, plugin, and runtime, as well as configure containerd to make use of the runtime.
 kOps will also install a RuntimeClass `nvidia`. As the nvidia runtime is not the default runtime, you will need to add `runtimeClassName: nvidia` to any Pod spec you want to use for GPU workloads. The RuntimeClass also configures the appropriate node selectors and tolerations to run on GPU Nodes.
 kOps will add `kops.k8s.io/gpu="1"` as node selector as well as the following taint:
 ```yaml
  taints:
  - effect: NoSchedule
    key: nvidia.com/gpu
 ```
 The taint will prevent you from accidentially scheduling workloads on GPU Nodes.
 You can enable nvidia by adding the following to your Cluster spec:
 ```yaml
  containerd:
    nvidiaGPU:
      enabled: true
 ```
 ## Creating an instance group with GPU nodeN
 Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the [getting started documentation](https://kops.sigs.k8s.io/getting_started/aws/).
@ -25,78 +49,4 @@ spec:
  role: Node
  subnets:
  - eu-central-1c
  taints:
  - nvidia.com/gpu=present:NoSchedule
 ```
 Note the taint used above. This will prevent pods from being scheduled on GPU nodes unless we explicitly want to. The GPU Operator resources tolerate this taint by default.
 Also note the node label we set. This will be used to ensure the GPU Operator resources runs on GPU nodes. 
 ## Install GPU Operator
 GPU Operator is installed using `helm`. See the [general install instructions for GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-gpu-operator).
 In order to match the kOps environment, create a `values.yaml` file with the following content:
 ```yaml
 operator:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
 driver:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
 toolkit:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
 devicePlugin:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
 dcgmExporter:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
 gfd:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
 node-feature-discovery:
  worker:
    nodeSelector:
      kops.k8s.io/instancegroup: gpu-nodes
    tolerations:
    - key: nvidia.com/gpu
      operator: Exists
 ```
 Once you have installed the the _helm chart_ you should be able to see the GPU operator resources being spawned in the `gpu-operator-resources` namespace.
 You should now be able to schedule other workloads on the GPU by adding the following properties to the pod spec:
 ```yaml
 spec:
  nodeSelector:
    kops.k8s.io/instancegroup: gpu-nodes
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
 ```
--- a/docs/releases/1.22-NOTES.md
+++ b/docs/releases/1.22-NOTES.md
@ -34,6 +34,9 @@ spec:
 Currently this is only available using the AWS cloud provider.
 ## Managed nvidia instances
 kOps can now provision instances with nvidia GPUs and configure it for container workloads without the need of hooks and operators. See [GPU support](https://kops.sigs.k8s.io/gpu/)
 ## Other significant changes