pipelines/components/gcp/dataproc/create_cluster/sample.ipynb

252 lines
8.9 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Name\n",
"Data processing by creating a cluster in Cloud Dataproc\n",
"\n",
"\n",
"# Label\n",
"Cloud Dataproc, cluster, GCP, Cloud Storage, KubeFlow, Pipeline\n",
"\n",
"\n",
"# Summary\n",
"A Kubeflow Pipeline component to create a cluster in Cloud Dataproc.\n",
"\n",
"# Details\n",
"## Intended use\n",
"\n",
"Use this component at the start of a Kubeflow Pipeline to create a temporary Cloud Dataproc cluster to run Cloud Dataproc jobs as steps in the pipeline.\n",
"\n",
"## Runtime arguments\n",
"\n",
"| Argument | Description | Optional | Data type | Accepted values | Default |\n",
"|----------|-------------|----------|-----------|-----------------|---------|\n",
"| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | |\n",
"| region | The Cloud Dataproc region to create the cluster in. | No | GCPRegion | | |\n",
"| name | The name of the cluster. Cluster names within a project must be unique. You can reuse the names of deleted clusters. | Yes | String | | None |\n",
"| name_prefix | The prefix of the cluster name. | Yes | String | | None |\n",
"| initialization_actions | A list of Cloud Storage URIs identifying executables to execute on each node after the configuration is completed. By default, executables are run on the master and all the worker nodes. | Yes | List | | None |\n",
"| config_bucket | The Cloud Storage bucket to use to stage the job dependencies, the configuration files, and the job driver consoles output. | Yes | GCSPath | | None |\n",
"| image_version | The version of the software inside the cluster. | Yes | String | | None |\n",
"| cluster | The full [cluster configuration](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster). | Yes | Dict | | None |\n",
"| wait_interval | The number of seconds to pause before polling the operation. | Yes | Integer | | 30 |\n",
"\n",
"## Output\n",
"Name | Description | Type\n",
":--- | :---------- | :---\n",
"cluster_name | The name of the cluster. | String\n",
"\n",
"Note: You can recycle the cluster by using the [Dataproc delete cluster component](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/delete_cluster).\n",
"\n",
"\n",
"## Cautions & requirements\n",
"\n",
"To use the component, you must:\n",
"* Set up the GCP project by following these [steps](https://cloud.google.com/dataproc/docs/guides/setup-project).\n",
"* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n",
"\n",
" ```\n",
" component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n",
" ```\n",
"* Grant the following types of access to the Kubeflow user service account:\n",
" * Read access to the Cloud Storage buckets which contains initialization action files.\n",
" * The role, `roles/dataproc.editor` on the project.\n",
"\n",
"## Detailed description\n",
"\n",
"This component creates a new Dataproc cluster by using the [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create). \n",
"\n",
"Follow these steps to use the component in a pipeline:\n",
"\n",
"1. Install the Kubeflow Pipeline SDK:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%capture --no-stderr\n",
"\n",
"KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.14/kfp.tar.gz'\n",
"!pip3 install $KFP_PACKAGE --upgrade"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Load the component using KFP SDK"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import kfp.components as comp\n",
"\n",
"dataproc_create_cluster_op = comp.load_component_from_url(\n",
" 'https://raw.githubusercontent.com/kubeflow/pipelines/e598176c02f45371336ccaa819409e8ec83743df/components/gcp/dataproc/create_cluster/component.yaml')\n",
"help(dataproc_create_cluster_op)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sample\n",
"Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n",
"\n",
"#### Set sample parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"# Required Parameters\n",
"PROJECT_ID = '<Please put your project ID here>'\n",
"\n",
"# Optional Parameters\n",
"EXPERIMENT_NAME = 'Dataproc - Create Cluster'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example pipeline that uses the component"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import kfp.dsl as dsl\n",
"import kfp.gcp as gcp\n",
"import json\n",
"@dsl.pipeline(\n",
" name='Dataproc create cluster pipeline',\n",
" description='Dataproc create cluster pipeline'\n",
")\n",
"def dataproc_create_cluster_pipeline(\n",
" project_id = PROJECT_ID, \n",
" region = 'us-central1', \n",
" name='', \n",
" name_prefix='',\n",
" initialization_actions='', \n",
" config_bucket='', \n",
" image_version='', \n",
" cluster='', \n",
" wait_interval='30'\n",
"):\n",
" dataproc_create_cluster_op(\n",
" project_id=project_id, \n",
" region=region, \n",
" name=name, \n",
" name_prefix=name_prefix, \n",
" initialization_actions=initialization_actions, \n",
" config_bucket=config_bucket, \n",
" image_version=image_version, \n",
" cluster=cluster, \n",
" wait_interval=wait_interval).apply(gcp.use_gcp_secret('user-gcp-sa'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Compile the pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pipeline_func = dataproc_create_cluster_pipeline\n",
"pipeline_filename = pipeline_func.__name__ + '.zip'\n",
"import kfp.compiler as compiler\n",
"compiler.Compiler().compile(pipeline_func, pipeline_filename)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit the pipeline for execution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Specify pipeline argument values\n",
"arguments = {}\n",
"\n",
"#Get or create an experiment and submit a pipeline run\n",
"import kfp\n",
"client = kfp.Client()\n",
"experiment = client.create_experiment(EXPERIMENT_NAME)\n",
"\n",
"#Submit a pipeline run\n",
"run_name = pipeline_func.__name__ + ' run'\n",
"run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"* [Kubernetes Engine for Kubeflow](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts)\n",
"* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py)\n",
"* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n",
"* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/create_cluster/sample.ipynb)\n",
"* [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create)\n",
"\n",
"## License\n",
"By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}