* delete deprecated GCP components * remove build and test scripts, references to code/folder from master |
||
|---|---|---|
| .. | ||
| README.md | ||
| xgboost_training_cm.py | ||
README.md
Overview
The xgboost_training_cm.py pipeline creates XGBoost models on structured data in CSV format. Both classification and regression are supported.
The pipeline starts by creating a Google DataProc cluster, and then running analysis, transformation, distributed training and prediction in the created cluster. Then a single node confusion-matrix and ROC aggregator is used (for classification case) to provide the confusion matrix data, and ROC data to the front end, respectively. Finally, a delete cluster operation runs to destroy the cluster it creates in the beginning. The delete cluster operation is used as an exit handler, meaning it will run regardless of whether the pipeline fails or not.
Requirements
⚠️ If you are using full-scope or workload identity enabled cluster in hosted pipeline beta version, DO NOT follow this section. However you'll still need to enable corresponding GCP API.
Preprocessing uses Google Cloud DataProc. Therefore, you must enable the
Cloud Dataproc API for the given GCP project. This is the
general guideline for enabling GCP APIs.
If KFP was deployed through K8S marketplace, please follow instructions in the guideline
to make sure the service account used has the role storage.admin and dataproc.admin.
Quota
By default, Dataproc create_cluster creates a master instance of machine type 'n1-standard-4',
together with two worker instances of machine type 'n1-standard-4'. This sums up
to a request consuming 12.0 vCPU quota. The user GCP project needs to guarantee
this quota is available to make this sample work.
⚠️ Free-tier GCP account might not be able to fulfill this quota requirement. For upgrading your account please follow this link.
Compile
Follow the guide to building a pipeline to install the Kubeflow Pipelines SDK and compile the sample Python into a workflow specification. The specification takes the form of a YAML file compressed into a .zip file.
Deploy
Open the Kubeflow pipelines UI. Create a new pipeline, and then upload the compiled specification (.zip file) as a new pipeline template.
Run
All arguments come with default values. This pipeline is preloaded as a Demo pipeline in Pipeline UI. You can run the pipeline without any changes.
Modifying the pipeline
To do additional exploration you may change some of the parameters, or pipeline input that is currently specified in the pipeline definition.
outputis a Google Storage path which holds pipeline run results. Note that each pipeline run will create a unique directory underoutputso it will not override previous results.workersis number of worker nodes used for this training.roundsis the number of XGBoost training iterations. Set the value to 200 to get a reasonable trained model.train_datapoints to a CSV file that contains the training data. For a sample see 'gs://ml-pipeline-playground/sfpd/train.csv'.eval_datapoints to a CSV file that contains the training data. For a sample see 'gs://ml-pipeline-playground/sfpd/eval.csv'.schemapoints to a schema file for train and eval datasets. For a sample see 'gs://ml-pipeline-playground/sfpd/schema.json'.
Components source
Create Cluster: source code
Analyze (step one for preprocessing), Transform (step two for preprocessing) are using pyspark job submission component, with source code
Distributed Training and predictions are using spark job submission component, with source code
Delete Cluster: source code
The container file is located here
For visualization, we use confusion matrix and ROC. Confusion Matrix: source code, container and ROC: source code, container