GPU with Kubeflow Pipeline Standalone (#3484)

* GPU with Kubeflow Pipeline Standalone * done * dont' check in compiled pipeline * gpu tpu preemptible * done * scope and quota comment
2020-04-14 17:31:11 +08:00 · 2020-04-14 17:31:11 +08:00 · 5882e52acc
parent 8d9a04a8ca
commit 5882e52acc
2 changed files with 307 additions and 0 deletions
--- a/samples/tutorials/gpu/README.md
+++ b/samples/tutorials/gpu/README.md
@ -0,0 +1,6 @@
+# GPU
+
+This folder contains a GPU sample.
+- Demo how to setup one GPU node pool with low cost via autoscaling.
+- Demo how to setup more than one GPU node pools in one cluster.
+- Demo how to use Kubeflow Pipeline SDK to consume GPU.
--- a/samples/tutorials/gpu/gpu.ipynb
+++ b/samples/tutorials/gpu/gpu.ipynb
@ -0,0 +1,301 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Preparation\n",
+    "\n",
+    "If you installed Kubeflow via [kfctl](https://www.kubeflow.org/docs/gke/customizing-gke/#common-customizations), you may already prepared GPU enviroment and can skip this section.\n",
+    "\n",
+    "If you installed Kubeflow Pipelines via [Google Cloud AI Platform Pipelines UI](https://console.cloud.google.com/ai-platform/pipelines/) or [Standalone manifest](https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize), please follow following steps to setup GPU enviroment.\n",
+    "\n",
+    "## Add GPU nodes to your cluster\n",
+    "\n",
+    "To see which accelerators are available in each zone, run the following command or check the [document](https://cloud.google.com/compute/docs/gpus#gpus-list)\n",
+    "\n",
+    "```\n",
+    "gcloud compute accelerator-types list\n",
+    "```\n",
+    "\n",
+    "You may also check or edit the GCP's **GPU Quota** to make sure you still have GPU quota in the region.\n",
+    "\n",
+    "To well saving the costs, it's possible you create a zero-sized node pool for GPU and enable the autoscaling.\n",
+    "\n",
+    "Here is an example to create a P100 GPU node pool for a cluster.\n",
+    "\n",
+    "```shell\n",
+    "# You may customize these parameters.\n",
+    "export GPU_POOL_NAME=p100pool\n",
+    "export CLUSTER_NAME=existingClusterName\n",
+    "export CLUSTER_ZONE=us-west1-a\n",
+    "export GPU_TYPE=nvidia-tesla-p100\n",
+    "export GPU_COUNT=1\n",
+    "export MACHINE_TYPE=n1-highmem-16\n",
+    "\n",
+    "\n",
+    "# It may takes several minutes.\n",
+    "gcloud container node-pools create ${GPU_POOL_NAME} \\\n",
+    "  --accelerator type=${GPU_TYPE},count=${GPU_COUNT} \\\n",
+    "  --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} \\\n",
+    "  --num-nodes=0 --machine-type=${MACHINE_TYPE} --min-nodes=0 --max-nodes=5 --enable-autoscaling \\\n",
+    "  --scopes=cloud-platform\n",
+    "```\n",
+    "\n",
+    "Here in this sample, we specified **--scopes=cloud-platform**. More info is [here](https://cloud.google.com/sdk/gcloud/reference/container/node-pools/create#--scopes). It will allow the job in the node pool to use GCE Default Service Account to access GCP APIs (e.x. GCS etc.). You also use [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) or [Application Default Credential](https://cloud.google.com/docs/authentication/production) to replace **--scopes=cloud-platform**.\n",
+    "\n",
+    "## Install device driver to the cluster\n",
+    "\n",
+    "After adding GPU nodes to your cluster, you need to install NVIDIA’s device drivers to the nodes. Google provides a DaemonSet that automatically installs the drivers for you.\n",
+    "\n",
+    "To deploy the installation DaemonSet, run the following command. It's an one-off work.\n",
+    "\n",
+    "```shell\n",
+    "kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml\n",
+    "```\n",
+    "\n",
+    "# Consume GPU via Kubeflow Pipelines SDK\n",
+    "\n",
+    "Here is a [document](https://www.kubeflow.org/docs/gke/pipelines/enable-gpu-and-tpu/).\n",
+    "\n",
+    "Following is a sample quick smoking test.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import kfp\n",
+    "from kfp import dsl\n",
+    "\n",
+    "def gpu_smoking_check_op():\n",
+    "    return dsl.ContainerOp(\n",
+    "        name='check',\n",
+    "        image='tensorflow/tensorflow:latest-gpu',\n",
+    "        command=['sh', '-c'],\n",
+    "        arguments=['nvidia-smi']\n",
+    "    ).set_gpu_limit(1)\n",
+    "\n",
+    "@dsl.pipeline(\n",
+    "    name='GPU smoking check',\n",
+    "    description='Smoking check whether GPU env is ready.'\n",
+    ")\n",
+    "def gpu_pipeline():\n",
+    "    gpu_smoking_check = gpu_smoking_check_op()\n",
+    "\n",
+    "if __name__ == '__main__':\n",
+    "    kfp.compiler.Compiler().compile(gpu_pipeline, 'gpu_smoking_check.yaml')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You may see warning message from Kubeflow Pipeline logs saying \"Insufficient nvidia.com/gpu\". Please wait for few minutes.\n",
+    "\n",
+    "If everything runs well, it's expected to see the results of \"nvidia-smi\" mentions the CUDA version, GPU type and usage etc.\n",
+    "\n",
+    "> You may also notice that after the pod got finished, the new GPU node is still there. GKE autoscale algorithm will free that node if no usage for certain time. More info is [here](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Multiple GPUs pool in one cluster\n",
+    "\n",
+    "It's possible you want more then 1 type of GPU to be supported in one cluster.\n",
+    "\n",
+    "- There are several types of GPUs.\n",
+    "- Certain regions normally just support part of the GPUs ([document](https://cloud.google.com/compute/docs/gpus#gpus-list)).\n",
+    "\n",
+    "Since we can set \"--num-nodes=0\" for certain GPU node pool to save costs if no workload, we can create multiple node pools for different types of GPUs.\n",
+    "\n",
+    "## Add additional GPU nodes to your cluster\n",
+    "\n",
+    "\n",
+    "In upper section, we added a node pool for P100. Here we add another pool for V100.\n",
+    "\n",
+    "```shell\n",
+    "# You may customize these parameters.\n",
+    "export GPU_POOL_NAME=v100pool\n",
+    "export CLUSTER_NAME=existingClusterName\n",
+    "export CLUSTER_ZONE=us-west1-a\n",
+    "export GPU_TYPE=nvidia-tesla-v100\n",
+    "export GPU_COUNT=1\n",
+    "export MACHINE_TYPE=n1-highmem-8\n",
+    "\n",
+    "\n",
+    "# It may takes several minutes.\n",
+    "gcloud container node-pools create ${GPU_POOL_NAME} \\\n",
+    "  --accelerator type=${GPU_TYPE},count=${GPU_COUNT} \\\n",
+    "  --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} \\\n",
+    "  --num-nodes=0 --machine-type=${MACHINE_TYPE} --min-nodes=0 --max-nodes=5 --enable-autoscaling\n",
+    "```\n",
+    "\n",
+    "## Consume certain GPU via Kubeflow Pipelines SDK\n",
+    "\n",
+    "Please reference following sample which explictlly request to use certain GPU."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import kfp\n",
+    "from kfp import dsl\n",
+    "\n",
+    "def gpu_p100_op():\n",
+    "    return dsl.ContainerOp(\n",
+    "        name='check_p100',\n",
+    "        image='tensorflow/tensorflow:latest-gpu',\n",
+    "        command=['sh', '-c'],\n",
+    "        arguments=['nvidia-smi']\n",
+    "    ).set_gpu_limit(1).add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p100')\n",
+    "\n",
+    "def gpu_v100_op():\n",
+    "    return dsl.ContainerOp(\n",
+    "        name='check_v100',\n",
+    "        image='tensorflow/tensorflow:latest-gpu',\n",
+    "        command=['sh', '-c'],\n",
+    "        arguments=['nvidia-smi']\n",
+    "    ).set_gpu_limit(1).add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')\n",
+    "\n",
+    "@dsl.pipeline(\n",
+    "    name='GPU smoking check',\n",
+    "    description='Smoking check whether GPU env is ready.'\n",
+    ")\n",
+    "def gpu_pipeline():\n",
+    "    gpu_p100 = gpu_p100_op()\n",
+    "    gpu_v100 = gpu_v100_op()\n",
+    "\n",
+    "if __name__ == '__main__':\n",
+    "    kfp.compiler.Compiler().compile(gpu_pipeline, 'gpu_smoking_check.yaml')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It's expected it runs well and you will see different \"nvidia-smi\" logs from the two pipeline steps."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preemptible GPU\n",
+    "Preemptible GPU resource is more cheaper but it also means your task requires retries.\n",
+    "\n",
+    "Please notice the following only difference is that it added **--preemptible** and **--node-taints=preemptible=true:NoSchedule** parameters.\n",
+    "\n",
+    "```\n",
+    "export GPU_POOL_NAME=v100pool-preemptible\n",
+    "export CLUSTER_NAME=existingClusterName\n",
+    "export CLUSTER_ZONE=us-west1-a\n",
+    "export GPU_TYPE=nvidia-tesla-v100\n",
+    "export GPU_COUNT=1\n",
+    "export MACHINE_TYPE=n1-highmem-8\n",
+    "\n",
+    "gcloud container node-pools create ${GPU_POOL_NAME} \\\n",
+    "  --accelerator type=${GPU_TYPE},count=${GPU_COUNT} \\\n",
+    "  --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} \\\n",
+    "  --preemptible \\\n",
+    "  --node-taints=preemptible=true:NoSchedule \\\n",
+    "  --num-nodes=0 --machine-type=${MACHINE_TYPE} --min-nodes=0 --max-nodes=5 --enable-autoscaling\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import kfp\n",
+    "import kfp.gcp as gcp\n",
+    "from kfp import dsl\n",
+    "\n",
+    "def gpu_p100_op():\n",
+    "    return dsl.ContainerOp(\n",
+    "        name='check_p100',\n",
+    "        image='tensorflow/tensorflow:latest-gpu',\n",
+    "        command=['sh', '-c'],\n",
+    "        arguments=['nvidia-smi']\n",
+    "    ).set_gpu_limit(1).add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p100')\n",
+    "\n",
+    "def gpu_v100_op():\n",
+    "    return dsl.ContainerOp(\n",
+    "        name='check_v100',\n",
+    "        image='tensorflow/tensorflow:latest-gpu',\n",
+    "        command=['sh', '-c'],\n",
+    "        arguments=['nvidia-smi']\n",
+    "    ).set_gpu_limit(1).add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')\n",
+    "\n",
+    "def gpu_v100_preemptible_op():\n",
+    "    v100_op = dsl.ContainerOp(\n",
+    "        name='check_v100_preemptible',\n",
+    "        image='tensorflow/tensorflow:latest-gpu',\n",
+    "        command=['sh', '-c'],\n",
+    "        arguments=['nvidia-smi'])\n",
+    "    v100_op.set_gpu_limit(1)\n",
+    "    v100_op.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')\n",
+    "    v100_op.apply(gcp.use_preemptible_nodepool(hard_constraint=True))\n",
+    "    return v100_op\n",
+    "\n",
+    "@dsl.pipeline(\n",
+    "    name='GPU smoking check',\n",
+    "    description='Smoking check whether GPU env is ready.'\n",
+    ")\n",
+    "def gpu_pipeline():\n",
+    "    gpu_p100 = gpu_p100_op()\n",
+    "    gpu_v100 = gpu_v100_op()\n",
+    "    gpu_v100_preemptible = gpu_v100_preemptible_op()\n",
+    "\n",
+    "if __name__ == '__main__':\n",
+    "    kfp.compiler.Compiler().compile(gpu_pipeline, 'gpu_smoking_check.yaml')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# TPU\n",
+    "Google's TPU is awesome. It's faster and lower TOC. To consume TPU, no need to create node-pool, just call KFP SDK to use it. Here is a [doc](https://www.kubeflow.org/docs/gke/pipelines/enable-gpu-and-tpu/#configure-containerop-to-consume-tpus). Please notice that not all regions has TPU yet.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}