History

Nicholas Thomson d81c8095d0 refactor(components): AWS SageMaker - Full component refactoring (#4336 ) * Temporary rebase commit * Add yaml compiler * Add compiler CLI * Update Dockerfile to copy all files * Add validate input list vs dict * Add unit test for new train * Add minor bug fixes * Override tag when generating specs * Update pydocs with formatter * Add contributing doc * Add formatters to CONTRIBUTING * Add working generic logic applied to train * Update component input and output to inherit * Downgrade to Python 3.7 * Update add outputValue to arg list * Updated outputValue to outputPath * Add empty string default to not-required inputs * Update path to component relative to root * Update faulty False-y condition * Update outputs to write to file * Update doc formatting * Update docstrings to match structure * Add unit tests for component and compiler * Add unit tests for component * Add spec unit tests * Add training unit tests * Update unit test automation * Add sample formatting checks * Remove extra flake8 check in integ tests * Add unit test black check * Update black formatting for all files * Update include black formatting * Add batch component * Remove old transform components * Update region input description * Add all component specs * Add deploy component * Add ground truth component * Add HPO component * Add create model component * Add processing component * Add workteam component * Add spec unit tests * Add deploy unit tests * Add ground truth unit tests * Add tuning component unit tests * Add create model component unit test * Add process component unit tests * Add workteam component unit tests * Remove output_path from required_args * Remove old component implementations * Update black formatting * Add assume role feature * Compiled all components * Update doc formatting * Fix process terminate syntax error * Update compiler to use kfp structures * Update nits * Update unified requirements * Rebase on debugging commit * Add debugger unit tests * Update formatting * Update component YAML * Fix unit test Dockerfile relative directory * Update unit test context to root * Update Batch to auto-generate name * Update minor docs and formatting changes * Update deploy name to common autogenerated * Add f-strings to logs * Add update support * Add Amazon license header * Update autogen and autoformat * Rename SpecValidator to SpecInputParser * Split requirements by dev and prod * Support for checking generated specs * Update minor changes * Update deploy component output description * Update components to beta repository * Update fix unit test requirements * Update unit test build spec for new results path * Update deploy wait for endpoint complete * Update component configure AWS clients in new method * Update boto3 retry method * Update license version * Update component YAML versions * Add new version to Changelog * Update component spec types * Update deploy config ignore overwrite * Update component for debugging * Update images back to 1.0.0 * Remove coverage from components		2020-10-27 14:17:57 -07:00
..
src	refactor(components): AWS SageMaker - Full component refactoring (#4336 )	2020-10-27 14:17:57 -07:00
README.md	feat(components): AWS SageMaker - Support for assuming a role (#4212 )	2020-08-03 10:53:43 -07:00
component.yaml	refactor(components): AWS SageMaker - Full component refactoring (#4336 )	2020-10-27 14:17:57 -07:00

README.md

SageMaker Batch Transform Kubeflow Pipeline component

Summary

Component to get inferences for an entire dataset in SageMaker from a Kubeflow Pipelines workflow.

Details

With batch transform, you create a batch transform job using a trained model and the dataset, which must be stored in Amazon S3. Use batch transform when you:

Want to get inferences for an entire dataset and index them to serve inferences in real time
Don't need a persistent endpoint that applications (for example, web or mobile apps) can call to get inferences
Don't need the subsecond latency that Amazon SageMaker hosted endpoints provide

Intended Use

Create a transform job in AWS SageMaker.

Runtime Arguments

Argument	Description	Optional (in pipeline definition)	Optional (in UI)	Data type	Accepted values	Default
region	The region where the endpoint is created	No	No	String
endpoint_url	The endpoint URL for the private link VPC endpoint	Yes	String
assume_role	The ARN of an IAM role to assume when connecting to SageMaker	Yes	String
job_name	The name of the transform job. The name must be unique within an AWS Region in an AWS account	Yes	Yes	String		is a generated name (combination of model_name and 'BatchTransform' string)
model_name	The name of the model that you want to use for the transform job. Model name must be the name of an existing Amazon SageMaker model within an AWS Region in an AWS account	No	No	String
max_concurrent	The maximum number of parallel requests that can be sent to each instance in a transform job	Yes	Yes	Integer		0
max_payload	The maximum allowed size of the payload, in MB	Yes	Yes	Integer	The value in max_payload must be greater than, or equal to, the size of a single record	6
batch_strategy	The number of records to include in a mini-batch for an HTTP inference request	Yes	Yes	String
environment	The environment variables to set in the Docker container	Yes	Yes	Dict	Maximum length of 1024. Key Pattern: `[a-zA-Z_][a-zA-Z0-9_]`. Value Pattern: `[\S\s]`. Upto 16 key and values entries in the map

The following parameters construct TransformInput object of the CreateTransformJob API. These describe the input source and the way the transform job consumes it.

Argument	Description	Optional (in pipeline definition)	Optional (in UI)	Data type	Accepted values	Default
input_location	The S3 location of the data source that is associated with a channel. Read more on S3Uri	No	No	String
data_type	Used by SageMaker to identify the objects from the S3 bucket to be used for batch transform. Read more on S3DataType	Yes	Yes	String	`ManifestFile`, `S3Prefix`, `AugmentedManifestFile`	`S3Prefix`
content_type	The multipurpose internet mail extension (MIME) type of the data. Amazon SageMaker uses the MIME type with each http call to transfer data to the transform job	Yes	Yes	String
split_type	The method to use to split the transform job data files into smaller batches	Yes	Yes	String	`Line`, `RecordIO`, `TFRecord`, `None`	`None`
compression_type	If the transform data is compressed, specify the compression type	Yes	Yes	String	`GZip`, `None`	`None`

input_location and data_type parameters above are used to construct S3DataSource object which is part of TransformDataSource object in TransformInput part of the CreateTransformJob API.

TransformInput={
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'ManifestFile'|'S3Prefix'|'AugmentedManifestFile',
                'S3Uri': 'string'
            }
        }, 
        ... other input parameters ...
    }

Ref

The following parameters are used to construct TransformOutput object of the CreateTransformJob API. These describe the results of a transform job.

Argument	Description	Optional (in pipeline definition)	Optional (in UI)	Data type	Accepted values	Default
output_location	The Amazon S3 path where you want Amazon SageMaker to store the results of the transform job	No	No	String
accept	The MIME type used to specify the output data. Amazon SageMaker uses the MIME type with each http call to transfer data from the transform job	Yes	Yes	String
assemble_with	Defines how to assemble the results of the transform job as a single S3 object. To concatenate the results in binary format, specify None. To add a newline character at the end of every transformed record, specify Line	Yes	Yes	String	`Line`, `None`	`None`
output_encryption_key	The AWS Key Management Service key to encrypt the model artifacts at rest using Amazon S3 server-side encryption	Yes	Yes	String	KmsKeyId formats

The following parameters are used to construct TransformResources object of the CreateTransformJob API. These describe the resources, including ML instance types and ML instance count, to use for the transform job.

Argument	Description	Optional (in pipeline definition)	Optional (in UI)	Data type	Accepted values	Default
instance_type	The ML compute instance type for the transform job	Yes	Yes	String	ml.m4.xlarge, ml.m4.2xlarge, ml.m4.4xlarge, ml.m4.10xlarge, ml.m4.16xlarge, ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ml.m5.12xlarge, ml.m5.24xlarge, ml.c4.xlarge, ml.c4.2xlarge, ml.c4.4xlarge, ml.c4.8xlarge, ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge, ml.p3.16xlarge, ml.c5.xlarge, ml.c5.2xlarge, ml.c5.4xlarge, ml.c5.9xlarge, ml.c5.18xlarge	ml.m4.xlarge
instance_count	The number of ML compute instances to use in the transform job	Yes	Yes	Integer		1
resource_encryption_key	The AWS Key Management Service (AWS KMS) key used to encrypt model data on the storage volume attached to the ML compute instance(s) that run the batch transform job.	Yes	Yes	String	VolumeKmsKeyId formats

The following parameters are used to construct DataProcessing object of the CreateTransformJob API. The data structure used to specify the data to be used for inference in a batch transform job and to associate the data that is relevant to the prediction results in the output.

Argument	Description	Optional (in pipeline definition)	Optional (in UI)	Data type	Accepted values	Default
input_filter	A JSONPath expression used to select a portion of the input data to pass to the algorithm. ReadMore on InputFilter	Yes	Yes	String
output_filter	A JSONPath expression used to select a portion of the joined dataset to save in the output file for a batch transform job. ReadMore on OutputFilter	Yes	Yes	String
join_source	Specifies the source of the data to join with the transformed data. ReadMore on JoinSource	Yes	Yes	String	`Input`, `None`	None

Notes:

Please use the links in the Resources section for detailed information on each input parameter and SageMaker APIs used in this component

Outputs

Name	Description
output_location	The Amazon S3 path where you want Amazon SageMaker to store the results of the transform job

README.md

SageMaker Batch Transform Kubeflow Pipeline component

Summary

Details

Intended Use

Runtime Arguments

Outputs

Requirements

Samples

Integrated into a pipeline

Resources