pipelines/components/gcp/bigquery/query/sample.ipynb

298 lines
9.5 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Name\n",
"\n",
"Gather training data by querying BigQuery \n",
"\n",
"\n",
"# Labels\n",
"\n",
"GCP, BigQuery, Kubeflow, Pipeline\n",
"\n",
"\n",
"# Summary\n",
"\n",
"A Kubeflow Pipeline component to submit a query to BigQuery and store the result in a Cloud Storage bucket.\n",
"\n",
"\n",
"# Details\n",
"\n",
"\n",
"## Intended use\n",
"\n",
"Use this Kubeflow component to:\n",
"* Select training data by submitting a query to BigQuery.\n",
"* Output the training data into a Cloud Storage bucket as CSV files.\n",
"\n",
"\n",
"## Runtime arguments:\n",
"\n",
"\n",
"| Argument | Description | Optional | Data type | Accepted values | Default |\n",
"|----------|-------------|----------|-----------|-----------------|---------|\n",
"| query | The query used by BigQuery to fetch the results. | No | String | | |\n",
"| project_id | The project ID of the Google Cloud Platform (GCP) project to use to execute the query. | No | GCPProjectID | | |\n",
"| dataset_id | The ID of the persistent BigQuery dataset to store the results of the query. If the dataset does not exist, the operation will create a new one. | Yes | String | | None |\n",
"| table_id | The ID of the BigQuery table to store the results of the query. If the table ID is absent, the operation will generate a random ID for the table. | Yes | String | | None |\n",
"| output_gcs_path | The path to the Cloud Storage bucket to store the query output. | Yes | GCSPath | | None |\n",
"| dataset_location | The location where the dataset is created. Defaults to US. | Yes | String | | US |\n",
"| job_config | The full configuration specification for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Yes | Dict | A JSONobject which has the same structure as [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) | None |\n",
"## Input data schema\n",
"\n",
"The input data is a BigQuery job containing a query that pulls data f rom various sources. \n",
"\n",
"\n",
"## Output:\n",
"\n",
"Name | Description | Type\n",
":--- | :---------- | :---\n",
"output_gcs_path | The path to the Cloud Storage bucket containing the query output in CSV format. | GCSPath\n",
"\n",
"## Cautions & requirements\n",
"\n",
"To use the component, the following requirements must be met:\n",
"\n",
"* The BigQuery API is enabled.\n",
"* The component can authenticate to use GCP APIs. Refer to [Authenticating Pipelines to GCP](https://www.kubeflow.org/docs/gke/authentication-pipelines/) for details.\n",
"* The Kubeflow user service account is a member of the `roles/bigquery.admin` role of the project.\n",
"* The Kubeflow user service account is a member of the `roles/storage.objectCreator `role of the Cloud Storage output bucket.\n",
"\n",
"## Detailed description\n",
"This Kubeflow Pipeline component is used to:\n",
"* Submit a query to BigQuery.\n",
" * The query results are persisted in a dataset table in BigQuery.\n",
" * An extract job is created in BigQuery to extract the data from the dataset table and output it to a Cloud Storage bucket as CSV files.\n",
"\n",
" Use the code below as an example of how to run your BigQuery job.\n",
"\n",
"### Sample\n",
"\n",
"Note: The following sample code works in an IPython notebook or directly in Python code.\n",
"\n",
"#### Set sample parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%capture --no-stderr\n",
"\n",
"!pip3 install kfp --upgrade"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Load the component using KFP SDK"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import kfp.components as comp\n",
"\n",
"bigquery_query_op = comp.load_component_from_url(\n",
" 'https://raw.githubusercontent.com/kubeflow/pipelines/01a23ae8672d3b18e88adf3036071496aca3552d/components/gcp/bigquery/query/component.yaml')\n",
"help(bigquery_query_op)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sample\n",
"\n",
"Note: The following sample code works in IPython notebook or directly in Python code.\n",
"\n",
"In this sample, we send a query to get the top questions from stackdriver public data and output the data to a Cloud Storage bucket. Here is the query:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"QUERY = 'SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions` LIMIT 10'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Set sample parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"# Required Parameters\n",
"PROJECT_ID = '<Please put your project ID here>'\n",
"GCS_WORKING_DIR = 'gs://<Please put your GCS path here>' # No ending slash"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Optional Parameters\n",
"EXPERIMENT_NAME = 'Bigquery -Query'\n",
"OUTPUT_PATH = '{}/bigquery/query/questions.csv'.format(GCS_WORKING_DIR)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Run the component as a single pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import kfp.dsl as dsl\n",
"import json\n",
"@dsl.pipeline(\n",
" name='Bigquery query pipeline',\n",
" description='Bigquery query pipeline'\n",
")\n",
"def pipeline(\n",
" query=QUERY, \n",
" project_id = PROJECT_ID, \n",
" dataset_id='', \n",
" table_id='', \n",
" output_gcs_path=OUTPUT_PATH, \n",
" dataset_location='US', \n",
" job_config=''\n",
"):\n",
" bigquery_query_op(\n",
" query=query, \n",
" project_id=project_id, \n",
" dataset_id=dataset_id, \n",
" table_id=table_id, \n",
" output_gcs_path=output_gcs_path, \n",
" dataset_location=dataset_location, \n",
" job_config=job_config)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Compile the pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pipeline_func = pipeline\n",
"pipeline_filename = pipeline_func.__name__ + '.zip'\n",
"import kfp.compiler as compiler\n",
"compiler.Compiler().compile(pipeline_func, pipeline_filename)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit the pipeline for execution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Specify pipeline argument values\n",
"arguments = {}\n",
"\n",
"#Get or create an experiment and submit a pipeline run\n",
"import kfp\n",
"client = kfp.Client()\n",
"experiment = client.create_experiment(EXPERIMENT_NAME)\n",
"\n",
"#Submit a pipeline run\n",
"run_name = pipeline_func.__name__ + ' run'\n",
"run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Inspect the output"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!gsutil cat $OUTPUT_PATH"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"* [Component python code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/bigquery/_query.py)\n",
"* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n",
"* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)\n",
"* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)\n",
"\n",
"## License\n",
"By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}