* Updated component images to version |
||
|---|---|---|
| .. | ||
| README.md | ||
| xgboost-training-cm.py | ||
README.md
Overview
The xgboost-training-cm.py pipeline creates XGBoost models on structured data in CSV format. Both classification and regression are supported.
The pipeline starts by creating an Google DataProc cluster, and then running analysis, transformation, distributed training and prediction in the created cluster. Then a single node confusion-matrix aggregator is used (for classification case) to provide the confusion matrix data to the front end. Finally, a delete cluster operation runs to destroy the cluster it creates in the beginning. The delete cluster operation is used as an exit handler, meaning it will run regardless of whether the pipeline fails or not.
Requirements
Preprocessing uses Google Cloud DataProc. Therefore, you must enable the DataProc API for the given GCP project.
Compile
Follow the guide to building a pipeline to install the Kubeflow Pipelines SDK and compile the sample Python into a workflow specification. The specification takes the form of a YAML file compressed into a .tar.gz file.
Deploy
Open the Kubeflow pipelines UI. Create a new pipeline, and then upload the compiled specification (.tar.gz file) as a new pipeline template.
Run
Most arguments come with default values. Only output and project need to be filled always.
outputis a Google Storage path which holds pipeline run results. Note that each pipeline run will create a unique directory underoutputso it will not override previous results.projectis a GCP project.
Components source
Create Cluster: source code container
Analyze (step one for preprocessing): source code container
Transform (step two for preprocessing): source code container
Distributed Training: source code container
Distributed Predictions: source code container
Confusion Matrix: source code container
ROC: source code container
Delete Cluster: source code container