pipelines/components/gcp/bigquery/query/README.md

195 lines
6.5 KiB
Markdown

# Name
Gather training data by querying BigQuery
# Labels
GCP, BigQuery, Kubeflow, Pipeline
# Summary
A Kubeflow Pipeline component to submit a query to BigQuery and store the result in a Cloud Storage bucket.
# Details
## Intended use
Use this Kubeflow component to:
* Select training data by submitting a query to BigQuery.
* Output the training data into a Cloud Storage bucket as CSV files.
## Runtime arguments:
| Argument | Description | Optional | Data type | Accepted values | Default |
|----------|-------------|----------|-----------|-----------------|---------|
| query | The query used by BigQuery to fetch the results. | No | String | | |
| project_id | The project ID of the Google Cloud Platform (GCP) project to use to execute the query. | No | GCPProjectID | | |
| dataset_id | The ID of the persistent BigQuery dataset to store the results of the query. If the dataset does not exist, the operation will create a new one. | Yes | String | | None |
| table_id | The ID of the BigQuery table to store the results of the query. If the table ID is absent, the operation will generate a random ID for the table. | Yes | String | | None |
| output_gcs_path | The path to the Cloud Storage bucket to store the query output. | Yes | GCSPath | | None |
| dataset_location | The location where the dataset is created. Defaults to US. | Yes | String | | US |
| job_config | The full configuration specification for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Yes | Dict | A JSONobject which has the same structure as [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) | None |
## Input data schema
The input data is a BigQuery job containing a query that pulls data f rom various sources.
## Output:
Name | Description | Type
:--- | :---------- | :---
output_gcs_path | The path to the Cloud Storage bucket containing the query output in CSV format. | GCSPath
## Cautions & requirements
To use the component, the following requirements must be met:
* The BigQuery API is enabled.
* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example:
```
bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))
```
* The Kubeflow user service account is a member of the `roles/bigquery.admin` role of the project.
* The Kubeflow user service account is a member of the `roles/storage.objectCreator `role of the Cloud Storage output bucket.
## Detailed description
This Kubeflow Pipeline component is used to:
* Submit a query to BigQuery.
* The query results are persisted in a dataset table in BigQuery.
* An extract job is created in BigQuery to extract the data from the dataset table and output it to a Cloud Storage bucket as CSV files.
Use the code below as an example of how to run your BigQuery job.
### Sample
Note: The following sample code works in an IPython notebook or directly in Python code.
#### Set sample parameters
```python
%%capture --no-stderr
KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.14/kfp.tar.gz'
!pip3 install $KFP_PACKAGE --upgrade
```
2. Load the component using KFP SDK
```python
import kfp.components as comp
bigquery_query_op = comp.load_component_from_url(
'https://raw.githubusercontent.com/kubeflow/pipelines/e598176c02f45371336ccaa819409e8ec83743df/components/gcp/bigquery/query/component.yaml')
help(bigquery_query_op)
```
### Sample
Note: The following sample code works in IPython notebook or directly in Python code.
In this sample, we send a query to get the top questions from stackdriver public data and output the data to a Cloud Storage bucket. Here is the query:
```python
QUERY = 'SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions` LIMIT 10'
```
#### Set sample parameters
```python
# Required Parameters
PROJECT_ID = '<Please put your project ID here>'
GCS_WORKING_DIR = 'gs://<Please put your GCS path here>' # No ending slash
```
```python
# Optional Parameters
EXPERIMENT_NAME = 'Bigquery -Query'
OUTPUT_PATH = '{}/bigquery/query/questions.csv'.format(GCS_WORKING_DIR)
```
#### Run the component as a single pipeline
```python
import kfp.dsl as dsl
import kfp.gcp as gcp
import json
@dsl.pipeline(
name='Bigquery query pipeline',
description='Bigquery query pipeline'
)
def pipeline(
query=QUERY,
project_id = PROJECT_ID,
dataset_id='',
table_id='',
output_gcs_path=OUTPUT_PATH,
dataset_location='US',
job_config=''
):
bigquery_query_op(
query=query,
project_id=project_id,
dataset_id=dataset_id,
table_id=table_id,
output_gcs_path=output_gcs_path,
dataset_location=dataset_location,
job_config=job_config).apply(gcp.use_gcp_secret('user-gcp-sa'))
```
#### Compile the pipeline
```python
pipeline_func = pipeline
pipeline_filename = pipeline_func.__name__ + '.zip'
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline_func, pipeline_filename)
```
#### Submit the pipeline for execution
```python
#Specify pipeline argument values
arguments = {}
#Get or create an experiment and submit a pipeline run
import kfp
client = kfp.Client()
experiment = client.create_experiment(EXPERIMENT_NAME)
#Submit a pipeline run
run_name = pipeline_func.__name__ + ' run'
run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)
```
#### Inspect the output
```python
!gsutil cat OUTPUT_PATH
```
## References
* [Component python code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/bigquery/_query.py)
* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)
* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)
* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)
## License
By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control.