pipelines

History

Nicholas Thomson d81c8095d0 refactor(components): AWS SageMaker - Full component refactoring (#4336 ) * Temporary rebase commit * Add yaml compiler * Add compiler CLI * Update Dockerfile to copy all files * Add validate input list vs dict * Add unit test for new train * Add minor bug fixes * Override tag when generating specs * Update pydocs with formatter * Add contributing doc * Add formatters to CONTRIBUTING * Add working generic logic applied to train * Update component input and output to inherit * Downgrade to Python 3.7 * Update add outputValue to arg list * Updated outputValue to outputPath * Add empty string default to not-required inputs * Update path to component relative to root * Update faulty False-y condition * Update outputs to write to file * Update doc formatting * Update docstrings to match structure * Add unit tests for component and compiler * Add unit tests for component * Add spec unit tests * Add training unit tests * Update unit test automation * Add sample formatting checks * Remove extra flake8 check in integ tests * Add unit test black check * Update black formatting for all files * Update include black formatting * Add batch component * Remove old transform components * Update region input description * Add all component specs * Add deploy component * Add ground truth component * Add HPO component * Add create model component * Add processing component * Add workteam component * Add spec unit tests * Add deploy unit tests * Add ground truth unit tests * Add tuning component unit tests * Add create model component unit test * Add process component unit tests * Add workteam component unit tests * Remove output_path from required_args * Remove old component implementations * Update black formatting * Add assume role feature * Compiled all components * Update doc formatting * Fix process terminate syntax error * Update compiler to use kfp structures * Update nits * Update unified requirements * Rebase on debugging commit * Add debugger unit tests * Update formatting * Update component YAML * Fix unit test Dockerfile relative directory * Update unit test context to root * Update Batch to auto-generate name * Update minor docs and formatting changes * Update deploy name to common autogenerated * Add f-strings to logs * Add update support * Add Amazon license header * Update autogen and autoformat * Rename SpecValidator to SpecInputParser * Split requirements by dev and prod * Support for checking generated specs * Update minor changes * Update deploy component output description * Update components to beta repository * Update fix unit test requirements * Update unit test build spec for new results path * Update deploy wait for endpoint complete * Update component configure AWS clients in new method * Update boto3 retry method * Update license version * Update component YAML versions * Add new version to Changelog * Update component spec types * Update deploy config ignore overwrite * Update component for debugging * Update images back to 1.0.0 * Remove coverage from components		2020-10-27 14:17:57 -07:00
..
README.md	AWS SageMaker : Use IAM Roles for Service Account (#3719 )	2020-05-21 10:36:14 -07:00
mini-image-classification-pipeline.py	refactor(components): AWS SageMaker - Full component refactoring (#4336 )	2020-10-27 14:17:57 -07:00
prep_inputs.py	refactor(components): AWS SageMaker - Full component refactoring (#4336 )	2020-10-27 14:17:57 -07:00

README.md

The mini-image-classification-pipeline.py sample runs a pipeline to demonstrate usage for the create workteam, Ground Truth, and train components.

This sample is based on this example.

The sample goes through the workflow of creating a private workteam, creating data labeling jobs for that team, and running a training job using the new labeled data.

Prerequisites

Make sure you have the setup explained in this README.md (This pipeline does not use mnist dataset. Follow the instruction bellow to get sample dataset)

Prep the dataset, label categories, and UI template

For this demo, you will be using a very small subset of the Google Open Images dataset.

Run the following to download openimgs-annotations.csv:

wget --no-verbose https://storage.googleapis.com/openimages/2018_04/test/test-annotations-human-imagelabels-boxable.csv -O openimgs-annotations.csv

Create a s3 bucket and run this python script to get the images and generate train.manifest, validation.manifest, class_labels.json, and instuctions.template.

Amazon Cognito user groups

From Cognito note down Pool ID, User Group Name and client ID
You need this information to fill arguments user_pool, user_groups and client_ID

Official doc for Amazon Cognito

For this demo you can create a new user pool (if you don't have one already).

AWS console -> Amazon SageMaker -> Ground Truth, Labeling workforces -> Private -> Create Private Team -> Give it "KFP-ground-truth-demo-pool" name and use your email address -> Create Private team -> Click on the radio button and from summary note down the "Amazon Cognito user pool", "App client" and "Labeling portal sign-in URL" -> click on the team name that you created and note down "Amazon Cognito user group"

Use the info that you noted down to fill arguments for the pipeline
user_pool = Amazon Cognito user pool
user_groups = Amazon Cognito user group
client_ID = App client

Note : Once you start a run on the pipeline you will receive the ground_truth labeling jobs at "Labeling portal sign-in URL" link

Compiling the pipeline template

Follow the guide to building a pipeline to install the Kubeflow Pipelines SDK, then run the following command to compile the sample Python into a workflow specification. The specification takes the form of a YAML file compressed into a .tar.gz file.

dsl-compile --py mini-image-classification-pipeline.py --output mini-image-classification-pipeline.tar.gz

Deploying the pipeline

Open the Kubeflow pipelines UI. Create a new pipeline, and then upload the compiled specification (.tar.gz file) as a new pipeline template.

The pipeline requires several arguments - replace role_arn, Amazon Cognito information, and the S3 input paths with your settings, and run the pipeline.

Note : team_name, ground_truth_train_job_name and ground_truth_validation_job_name need to be unique or else pipeline will error out if the names already exist

If you are a new worker, you will receive an email with a link to the labeling portal and login information after the create workteam component completes. During the execution of the two Ground Truth components (one for training data, one for validation data), the labeling jobs will appear in the portal and you will need to complete these jobs.

After the pipeline finished, you may delete the user pool/ user group and the S3 bucket.

Components source

Create Workteam: source code

Ground Truth Labeling: source code

Training: source code