# Name Batch prediction using Cloud Machine Learning Engine # Label Cloud Storage, Cloud ML Engine, Kubeflow, Pipeline, Component # Summary A Kubeflow Pipeline component to submit a batch prediction job against a deployed model on Cloud ML Engine. # Details ## Intended use Use the component to run a batch prediction job against a deployed model on Cloud ML Engine. The prediction output is stored in a Cloud Storage bucket. ## Runtime arguments | Argument | Description | Optional | Data type | Accepted values | Default | |--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------------|-----------------|---------| | project_id | The ID of the Google Cloud Platform (GCP) project of the job. | No | GCPProjectID | | | | model_path | The path to the model. It can be one of the following:
| No | GCSPath | | | | input_paths | The path to the Cloud Storage location containing the input data files. It can contain wildcards, for example, `gs://foo/*.csv` | No | List | GCSPath | | | input_data_format | The format of the input data files. See [REST Resource: projects.jobs](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat) for more details. | No | String | DataFormat | | | output_path | The path to the Cloud Storage location for the output data. | No | GCSPath | | | | region | The Compute Engine region where the prediction job is run. | No | GCPRegion | | | | output_data_format | The format of the output data files. See [REST Resource: projects.jobs](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat) for more details. | Yes | String | DataFormat | JSON | | prediction_input | The JSON input parameters to create a prediction job. See [PredictionInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#PredictionInput) for more information. | Yes | Dict | | None | | job_id_prefix | The prefix of the generated job id. | Yes | String | | None | | wait_interval | The number of seconds to wait in case the operation has a long run time. | Yes | | | 30 | ## Input data schema The component accepts the following as input: * A trained model: It can be a model file in Cloud Storage, a deployed model, or a version in Cloud ML Engine. Specify the path to the model in the `model_path `runtime argument. * Input data: The data used to make predictions against the trained model. The data can be in [multiple formats](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat). The data path is specified by `input_paths` and the format is specified by `input_data_format`. ## Output Name | Description | Type :--- | :---------- | :--- job_id | The ID of the created batch job. | String ## Cautions & requirements To use the component, you must: * Set up a cloud environment by following this [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup). * The component can authenticate to GCP. Refer to [Authenticating Pipelines to GCP](https://www.kubeflow.org/docs/gke/authentication-pipelines/) for details. * Grant the following types of access to the Kubeflow user service account: * Read access to the Cloud Storage buckets which contains the input data. * Write access to the Cloud Storage bucket of the output directory. ## Detailed description Follow these steps to use the component in a pipeline: 1. Install the Kubeflow Pipeline SDK: ```python %%capture --no-stderr !pip3 install kfp --upgrade ``` 2. Load the component using KFP SDK ```python import kfp.components as comp mlengine_batch_predict_op = comp.load_component_from_url( 'https://raw.githubusercontent.com/kubeflow/pipelines/1.4.1/components/gcp/ml_engine/batch_predict/component.yaml') help(mlengine_batch_predict_op) ``` ### Sample Code Note: The following sample code works in an IPython notebook or directly in Python code. In this sample, you batch predict against a pre-built trained model from `gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/` and use the test data from `gs://ml-pipeline-playground/samples/ml_engine/census/test.json`. #### Inspect the test data ```python !gsutil cat gs://ml-pipeline-playground/samples/ml_engine/census/test.json ``` #### Set sample parameters ```python # Required Parameters PROJECT_ID = '' GCS_WORKING_DIR = 'gs://' # No ending slash ``` ```python # Optional Parameters EXPERIMENT_NAME = 'CLOUDML - Batch Predict' OUTPUT_GCS_PATH = GCS_WORKING_DIR + '/batch_predict/output/' ``` #### Example pipeline that uses the component ```python import kfp.dsl as dsl import json @dsl.pipeline( name='CloudML batch predict pipeline', description='CloudML batch predict pipeline' ) def pipeline( project_id = PROJECT_ID, model_path = 'gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/', input_paths = '["gs://ml-pipeline-playground/samples/ml_engine/census/test.json"]', input_data_format = 'JSON', output_path = OUTPUT_GCS_PATH, region = 'us-central1', output_data_format='', prediction_input = json.dumps({ 'runtimeVersion': '1.10' }), job_id_prefix='', wait_interval='30'): mlengine_batch_predict_op( project_id=project_id, model_path=model_path, input_paths=input_paths, input_data_format=input_data_format, output_path=output_path, region=region, output_data_format=output_data_format, prediction_input=prediction_input, job_id_prefix=job_id_prefix, wait_interval=wait_interval) ``` #### Compile the pipeline ```python pipeline_func = pipeline pipeline_filename = pipeline_func.__name__ + '.zip' import kfp.compiler as compiler compiler.Compiler().compile(pipeline_func, pipeline_filename) ``` #### Submit the pipeline for execution ```python #Specify pipeline argument values arguments = {} #Get or create an experiment and submit a pipeline run import kfp client = kfp.Client() experiment = client.create_experiment(EXPERIMENT_NAME) #Submit a pipeline run run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` #### Inspect prediction results ```python OUTPUT_FILES_PATTERN = OUTPUT_GCS_PATH + '*' !gsutil cat OUTPUT_FILES_PATTERN ``` ## References * [Component python code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/ml_engine/_batch_predict.py) * [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) * [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/batch_predict/sample.ipynb) * [Cloud Machine Learning Engine job REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs) ## License By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control.