pipelines/components/gcp/dataproc/submit_pig_job/sample.ipynb

261 lines
9.2 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Name\n",
"Data preparation using Apache Pig on YARN with Cloud Dataproc\n",
"\n",
"# Label\n",
"Cloud Dataproc, GCP, Cloud Storage, YARN, Pig, Apache, Kubeflow, pipelines, components\n",
"\n",
"\n",
"# Summary\n",
"A Kubeflow Pipeline component to prepare data by submitting an Apache Pig job on YARN to Cloud Dataproc.\n",
"\n",
"\n",
"# Details\n",
"## Intended use\n",
"Use the component to run an Apache Pig job as one preprocessing step in a Kubeflow Pipeline.\n",
"\n",
"## Runtime arguments\n",
"| Argument | Description | Optional | Data type | Accepted values | Default |\n",
"|----------|-------------|----------|-----------|-----------------|---------|\n",
"| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | | |\n",
"| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | |\n",
"| cluster_name | The name of the cluster to run the job. | No | String | | |\n",
"| queries | The queries to execute the Pig job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None |\n",
"| query_file_uri | The HCFS URI of the script that contains the Pig queries. | Yes | GCSPath | | None |\n",
"| script_variables | Mapping of the querys variable names to their values (equivalent to the Pig command: SET name=\"value\";). | Yes | Dict | | None |\n",
"| pig_job | The payload of a [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob). | Yes | Dict | | None |\n",
"| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n",
"| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 |\n",
"\n",
"## Output\n",
"Name | Description | Type\n",
":--- | :---------- | :---\n",
"job_id | The ID of the created job. | String\n",
"\n",
"## Cautions & requirements\n",
"\n",
"To use the component, you must:\n",
"* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n",
"* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n",
"* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n",
"\n",
" ```\n",
" component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n",
" ```\n",
"* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n",
"\n",
"## Detailed description\n",
"This component creates a Pig job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n",
"\n",
"Follow these steps to use the component in a pipeline:\n",
"1. Install the Kubeflow Pipeline SDK:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%capture --no-stderr\n",
"\n",
"KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.14/kfp.tar.gz'\n",
"!pip3 install $KFP_PACKAGE --upgrade"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Load the component using KFP SDK"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import kfp.components as comp\n",
"\n",
"dataproc_submit_pig_job_op = comp.load_component_from_url(\n",
" 'https://raw.githubusercontent.com/kubeflow/pipelines/e598176c02f45371336ccaa819409e8ec83743df/components/gcp/dataproc/submit_pig_job/component.yaml')\n",
"help(dataproc_submit_pig_job_op)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sample\n",
"\n",
"Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n",
"\n",
"\n",
"#### Setup a Dataproc cluster\n",
"\n",
"[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n",
"\n",
"\n",
"#### Prepare a Pig query\n",
"\n",
"Either put your Pig queries in the `queries` list, or upload your Pig queries into a file to a Cloud Storage bucket and then enter the Cloud Storage buckets path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a local `passwd` file.\n",
"\n",
"For more details on Apache Pig, see the [Pig documentation.](http://pig.apache.org/docs/latest/)\n",
"\n",
"#### Set sample parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"PROJECT_ID = '<Please put your project ID here>'\n",
"CLUSTER_NAME = '<Please put your existing cluster name here>'\n",
"\n",
"REGION = 'us-central1'\n",
"QUERY = '''\n",
"natality_csv = load 'gs://public-datasets/natality/csv' using PigStorage(':');\n",
"top_natality_csv = LIMIT natality_csv 10; \n",
"dump natality_csv;'''\n",
"EXPERIMENT_NAME = 'Dataproc - Submit Pig Job'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example pipeline that uses the component"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import kfp.dsl as dsl\n",
"import kfp.gcp as gcp\n",
"import json\n",
"@dsl.pipeline(\n",
" name='Dataproc submit Pig job pipeline',\n",
" description='Dataproc submit Pig job pipeline'\n",
")\n",
"def dataproc_submit_pig_job_pipeline(\n",
" project_id = PROJECT_ID, \n",
" region = REGION,\n",
" cluster_name = CLUSTER_NAME,\n",
" queries = json.dumps([QUERY]),\n",
" query_file_uri = '',\n",
" script_variables = '', \n",
" pig_job='', \n",
" job='', \n",
" wait_interval='30'\n",
"):\n",
" dataproc_submit_pig_job_op(\n",
" project_id=project_id, \n",
" region=region, \n",
" cluster_name=cluster_name, \n",
" queries=queries, \n",
" query_file_uri=query_file_uri,\n",
" script_variables=script_variables, \n",
" pig_job=pig_job, \n",
" job=job, \n",
" wait_interval=wait_interval).apply(gcp.use_gcp_secret('user-gcp-sa'))\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Compile the pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pipeline_func = dataproc_submit_pig_job_pipeline\n",
"pipeline_filename = pipeline_func.__name__ + '.zip'\n",
"import kfp.compiler as compiler\n",
"compiler.Compiler().compile(pipeline_func, pipeline_filename)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit the pipeline for execution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Specify pipeline argument values\n",
"arguments = {}\n",
"\n",
"#Get or create an experiment and submit a pipeline run\n",
"import kfp\n",
"client = kfp.Client()\n",
"experiment = client.create_experiment(EXPERIMENT_NAME)\n",
"\n",
"#Submit a pipeline run\n",
"run_name = pipeline_func.__name__ + ' run'\n",
"run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"* [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) \n",
"* [Pig documentation](http://pig.apache.org/docs/latest/)\n",
"* [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs)\n",
"* [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob)\n",
"\n",
"## License\n",
"By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}