pipelines/components/gcp/ml_engine/batch_predict/README.md

201 lines
7.6 KiB
Markdown

# Name
Batch prediction using Cloud Machine Learning Engine
# Label
Cloud Storage, Cloud ML Engine, Kubeflow, Pipeline, Component
# Summary
A Kubeflow Pipeline component to submit a batch prediction job against a deployed model on Cloud ML Engine.
# Details
## Intended use
Use the component to run a batch prediction job against a deployed model on Cloud ML Engine. The prediction output is stored in a Cloud Storage bucket.
## Runtime arguments
| Argument | Description | Optional | Data type | Accepted values | Default |
|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------------|-----------------|---------|
| project_id | The ID of the Google Cloud Platform (GCP) project of the job. | No | GCPProjectID | | |
| model_path | The path to the model. It can be one of the following:<br/> <ul> <li>projects/[PROJECT_ID]/models/[MODEL_ID]</li> <li>projects/[PROJECT_ID]/models/[MODEL_ID]/versions/[VERSION_ID]</li> <li>The path to a Cloud Storage location containing a model file.</li> </ul> | No | GCSPath | | |
| input_paths | The path to the Cloud Storage location containing the input data files. It can contain wildcards, for example, `gs://foo/*.csv` | No | List | GCSPath | |
| input_data_format | The format of the input data files. See [REST Resource: projects.jobs](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat) for more details. | No | String | DataFormat | |
| output_path | The path to the Cloud Storage location for the output data. | No | GCSPath | | |
| region | The Compute Engine region where the prediction job is run. | No | GCPRegion | | |
| output_data_format | The format of the output data files. See [REST Resource: projects.jobs](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat) for more details. | Yes | String | DataFormat | JSON |
| prediction_input | The JSON input parameters to create a prediction job. See [PredictionInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#PredictionInput) for more information. | Yes | Dict | | None |
| job_id_prefix | The prefix of the generated job id. | Yes | String | | None |
| wait_interval | The number of seconds to wait in case the operation has a long run time. | Yes | | | 30 |
## Input data schema
The component accepts the following as input:
* A trained model: It can be a model file in Cloud Storage, a deployed model, or a version in Cloud ML Engine. Specify the path to the model in the `model_path `runtime argument.
* Input data: The data used to make predictions against the trained model. The data can be in [multiple formats](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat). The data path is specified by `input_paths` and the format is specified by `input_data_format`.
## Output
Name | Description | Type
:--- | :---------- | :---
job_id | The ID of the created batch job. | String
## Cautions & requirements
To use the component, you must:
* Set up a cloud environment by following this [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup).
* The component can authenticate to GCP. Refer to [Authenticating Pipelines to GCP](https://www.kubeflow.org/docs/gke/authentication-pipelines/) for details.
* Grant the following types of access to the Kubeflow user service account:
* Read access to the Cloud Storage buckets which contains the input data.
* Write access to the Cloud Storage bucket of the output directory.
## Detailed description
Follow these steps to use the component in a pipeline:
1. Install the Kubeflow Pipeline SDK:
```python
%%capture --no-stderr
!pip3 install kfp --upgrade
```
2. Load the component using KFP SDK
```python
import kfp.components as comp
mlengine_batch_predict_op = comp.load_component_from_url(
'https://raw.githubusercontent.com/kubeflow/pipelines/1.4.0-rc.1/components/gcp/ml_engine/batch_predict/component.yaml')
help(mlengine_batch_predict_op)
```
### Sample Code
Note: The following sample code works in an IPython notebook or directly in Python code.
In this sample, you batch predict against a pre-built trained model from `gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/` and use the test data from `gs://ml-pipeline-playground/samples/ml_engine/census/test.json`.
#### Inspect the test data
```python
!gsutil cat gs://ml-pipeline-playground/samples/ml_engine/census/test.json
```
#### Set sample parameters
```python
# Required Parameters
PROJECT_ID = '<Please put your project ID here>'
GCS_WORKING_DIR = 'gs://<Please put your GCS path here>' # No ending slash
```
```python
# Optional Parameters
EXPERIMENT_NAME = 'CLOUDML - Batch Predict'
OUTPUT_GCS_PATH = GCS_WORKING_DIR + '/batch_predict/output/'
```
#### Example pipeline that uses the component
```python
import kfp.dsl as dsl
import json
@dsl.pipeline(
name='CloudML batch predict pipeline',
description='CloudML batch predict pipeline'
)
def pipeline(
project_id = PROJECT_ID,
model_path = 'gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/',
input_paths = '["gs://ml-pipeline-playground/samples/ml_engine/census/test.json"]',
input_data_format = 'JSON',
output_path = OUTPUT_GCS_PATH,
region = 'us-central1',
output_data_format='',
prediction_input = json.dumps({
'runtimeVersion': '1.10'
}),
job_id_prefix='',
wait_interval='30'):
mlengine_batch_predict_op(
project_id=project_id,
model_path=model_path,
input_paths=input_paths,
input_data_format=input_data_format,
output_path=output_path,
region=region,
output_data_format=output_data_format,
prediction_input=prediction_input,
job_id_prefix=job_id_prefix,
wait_interval=wait_interval)
```
#### Compile the pipeline
```python
pipeline_func = pipeline
pipeline_filename = pipeline_func.__name__ + '.zip'
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline_func, pipeline_filename)
```
#### Submit the pipeline for execution
```python
#Specify pipeline argument values
arguments = {}
#Get or create an experiment and submit a pipeline run
import kfp
client = kfp.Client()
experiment = client.create_experiment(EXPERIMENT_NAME)
#Submit a pipeline run
run_name = pipeline_func.__name__ + ' run'
run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)
```
#### Inspect prediction results
```python
OUTPUT_FILES_PATTERN = OUTPUT_GCS_PATH + '*'
!gsutil cat OUTPUT_FILES_PATTERN
```
## References
* [Component python code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/ml_engine/_batch_predict.py)
* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)
* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/batch_predict/sample.ipynb)
* [Cloud Machine Learning Engine job REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs)
## License
By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control.