# SageMaker Ground Truth Kubeflow Pipelines component
## Summary
Component to submit SageMaker Ground Truth labeling jobs directly from a Kubeflow Pipelines workflow.

# Details

## Intended Use
For Ground Truth jobs using AWS SageMaker.

## Runtime Arguments
Argument        | Description                 | Optional   | Data type  | Accepted values | Default    |
:---            | :----------                 | :----------| :----------| :----------     | :----------|
region | The region where the cluster launches | No | String | | |
endpoint_url | The endpoint URL for the private link VPC endpoint | Yes | String | | |
assume_role | The ARN of an IAM role to assume when connecting to SageMaker | Yes | String | | |
role | The Amazon Resource Name (ARN) that Amazon SageMaker assumes to perform tasks on your behalf | No | String | | |
job_name | The name of the Ground Truth job. Must be unique within the same AWS account and AWS region | Yes | String | | LabelingJob-[datetime]-[random id]|
label_attribute_name | The attribute name to use for the label in the output manifest file | Yes | String | | job_name |
manifest_location | The Amazon S3 location of the manifest file that describes the input data objects | No | String | | |
output_location | The Amazon S3 path where you want Amazon SageMaker to store the results of the transform job | No | String | | |
output_encryption_key | The AWS KMS key that Amazon SageMaker uses to encrypt the model artifacts | Yes | String | | |
task_type | Built in image classification, bounding box, text classification, or semantic segmentation, or custom; If custom, please provide pre- and post-labeling task lambda functions | No | String | Image Classification, Bounding Box, Text Classification, Semantic Segmentation, Custom | |
worker_type | The workteam for data labeling | No | String | Public, Private, Vendor | |
workteam_arn | The ARN of the work team assigned to complete the tasks; specify if worker type is private or vendor | Yes | String | | |
no_adult_content | If data is free of adult content; specify if worker type is public | Yes | Boolean | False, True | False |
no_ppi | If data is free of personally identifiable information; specify if worker type is public | Yes | Boolean | False, True | False |
label_category_config | The S3 URL of the JSON structured file that defines the categories used to label the data objects | Yes | String | | |
max_human_labeled_objects | The maximum number of objects that can be labeled by human workers | Yes | Int | ≥ 1 | all objects |
max_percent_objects | The maximum percentage of input data objects that should be labeled | Yes | Int | [1, 100] | 100 |
enable_auto_labeling | Enables auto-labeling; only for bounding box, text classification, and image classification | Yes | Boolean | False, True | False |
initial_model_arn | The ARN of the final model used for a previous auto-labeling job | Yes | String | | |
resource_encryption_key | The AWS KMS key that Amazon SageMaker uses to encrypt data on the storage volume attached to the ML compute instance(s) | Yes | String | | |
ui_template | The Amazon S3 bucket location of the UI template | No | String | | |
pre_human_task_function | The ARN of a Lambda function that is run before a data object is sent to a human worker | Yes | String | | |
post_human_task_function | The ARN of a Lambda function implements the logic for annotation consolidation | Yes | String | | |
task_keywords | Keywords used to describe the task so that workers on Amazon Mechanical Turk can discover the task | Yes | String | | |
title | A title for the task for your human workers | No | String | | |
description | A description of the task for your human workers | No | String | | |
num_workers_per_object | The number of human workers that will label an object | No | Int | [1, 9] | |
time_limit | The maximum run time in seconds per training job | No | Int | [30, 28800] | |
task_availibility | The length of time that a task remains available for labeling by human workers | Yes | Int | Public workforce: [1, 43200], other: [1, 864000] | |
max_concurrent_tasks | The maximum number of data objects that can be labeled by human workers at the same time | Yes | Int | [1, 1000] | |
workforce_task_price | The price that you pay for each task performed by a public worker in USD; Specify to the tenth fractions of a cent; Format as "0.000" | Yes | Float | 0.000 |
tags | Key-value pairs to categorize AWS resources | Yes | Dict | | {} |

## Outputs
Name | Description
:--- | :----------
output_manifest_location | URL where labeling results were stored
active_learning_model_arn | ARN of the resulting active learning model

# Requirements
* [Kubeflow pipelines SDK](https://www.kubeflow.org/docs/pipelines/sdk/install-sdk/)
* [Kubeflow set-up on AWS](https://www.kubeflow.org/docs/aws/deploy/install-kubeflow/)

# Samples
## Used in a pipeline with workteam creation and training
Mini image classification demo: [Demo](https://github.com/kubeflow/pipelines/blob/master/samples/contrib/aws-samples/ground_truth_pipeline_demo/)

# References
* [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html)
* [Building a custom data labeling workflow](https://aws.amazon.com/blogs/machine-learning/build-a-custom-data-labeling-workflow-with-amazon-sagemaker-ground-truth/)
* [Sample UI template for Bounding Box](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/ground_truth_labeling_jobs/ground_truth_object_detection_tutorial/object_detection_tutorial.ipynb)
* [Sample UI template for Image Classification](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/ground_truth_labeling_jobs/from_unlabeled_data_to_deployed_machine_learning_model_ground_truth_demo_image_classification)
* [Using Ground Truth results in training jobs](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/ground_truth_labeling_jobs/object_detection_augmented_manifest_training/object_detection_augmented_manifest_training.ipynb)