pipelines/components/aws/sagemaker/batch_transform
Thang Minh Vu 328edd8117
fix(components): make inputs.model_artifact_url optional in sagemaker model component (#8336)
* fix(components): make inputs.model_artifact_url optional in sagemaker model component

* chore: run black

* Fixed Stop bug

commit f2092382ee941c2f33935db3e886093a15f103f7
Author: ananth102 <abashyam@amazon.com>
Date:   Fri Oct 7 19:51:55 2022 +0000

    replaced image

commit 2f0e2daa54fe80a3dfc471d393be62d612217b84
Merge: bf2389a66 7ce165432
Author: ananth102 <abashyam@amazon.com>
Date:   Fri Oct 7 19:50:28 2022 +0000

    Merge remote-tracking branch 'stopfix/handle_stopped' into kfpv1fixes2

commit 7ce165432e
Author: Kartik Kalamadi <kalamadi@amazon.com>
Date:   Thu Mar 3 09:58:16 2022 -0800

    Run black

commit 32d6e1388a
Author: Kartik Kalamadi <kalamadi@amazon.com>
Date:   Tue Mar 1 15:25:32 2022 -0800

    Change image for testing

commit 7875d9aa27
Author: Kartik Kalamadi <kalamadi@amazon.com>
Date:   Mon Jan 31 09:29:50 2022 -0800

    Handle Stopped state for all components and fix bug in robomaker simulation function

* chore(docs): Update model README.md

Update README

* updated image and liscense

* chore: pop ModelDataUrl if not exist

* fix: make field as option in aws batch_transform component

chore: run black

chore: revert docker version pump up

chore(docs): update valid instance types

Remove key if not use

Pop KmsKeyId

* update changelog

* chore: pop DataProcessing if no value supplied

* test(components): Update test

* fix(batch_transform): only pop input and output

* fixed log bug

Co-authored-by: ananth102 <abashyam@amazon.com>
2022-10-14 22:12:49 +00:00
..
src fix(components): make inputs.model_artifact_url optional in sagemaker model component (#8336) 2022-10-14 22:12:49 +00:00
README.md fix(components): make inputs.model_artifact_url optional in sagemaker model component (#8336) 2022-10-14 22:12:49 +00:00
component.yaml fix(components): make inputs.model_artifact_url optional in sagemaker model component (#8336) 2022-10-14 22:12:49 +00:00

README.md

SageMaker Batch Transform Kubeflow Pipeline component

Summary

Component to get inferences for an entire dataset in SageMaker from a Kubeflow Pipelines workflow.

Details

With batch transform, you create a batch transform job using a trained model and the dataset, which must be stored in Amazon S3. Use batch transform when you:

  • Want to get inferences for an entire dataset and index them to serve inferences in real time
  • Don't need a persistent endpoint that applications (for example, web or mobile apps) can call to get inferences
  • Don't need the subsecond latency that Amazon SageMaker hosted endpoints provide

Intended Use

Create a transform job in AWS SageMaker.

Runtime Arguments

Argument Description Optional (in pipeline definition) Optional (in UI) Data type Accepted values Default
region The region where the endpoint is created No No String
endpoint_url The endpoint URL for the private link VPC endpoint Yes String
assume_role The ARN of an IAM role to assume when connecting to SageMaker Yes String
job_name The name of the transform job. The name must be unique within an AWS Region in an AWS account Yes Yes String is a generated name (combination of model_name and 'BatchTransform' string)
model_name The name of the model that you want to use for the transform job. Model name must be the name of an existing Amazon SageMaker model within an AWS Region in an AWS account No No String
max_concurrent The maximum number of parallel requests that can be sent to each instance in a transform job Yes Yes Integer 0
max_payload The maximum allowed size of the payload, in MB Yes Yes Integer The value in max_payload must be greater than, or equal to, the size of a single record 6
batch_strategy The number of records to include in a mini-batch for an HTTP inference request Yes Yes String
environment The environment variables to set in the Docker container Yes Yes Dict Maximum length of 1024. Key Pattern: [a-zA-Z_][a-zA-Z0-9_]*. Value Pattern: [\S\s]*. Upto 16 key and values entries in the map

The following parameters construct TransformInput object of the CreateTransformJob API. These describe the input source and the way the transform job consumes it.

Argument Description Optional (in pipeline definition) Optional (in UI) Data type Accepted values Default
input_location The S3 location of the data source that is associated with a channel. Read more on S3Uri No No String
data_type Used by SageMaker to identify the objects from the S3 bucket to be used for batch transform. Read more on S3DataType Yes Yes String ManifestFile, S3Prefix, AugmentedManifestFile S3Prefix
content_type The multipurpose internet mail extension (MIME) type of the data. Amazon SageMaker uses the MIME type with each http call to transfer data to the transform job Yes Yes String
split_type The method to use to split the transform job data files into smaller batches Yes Yes String Line, RecordIO, TFRecord, None None
compression_type If the transform data is compressed, specify the compression type Yes Yes String GZip, None None
TransformInput={
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'ManifestFile'|'S3Prefix'|'AugmentedManifestFile',
                'S3Uri': 'string'
            }
        }, 
        ... other input parameters ...
    }

Ref

The following parameters are used to construct TransformOutput object of the CreateTransformJob API. These describe the results of a transform job.

Argument Description Optional (in pipeline definition) Optional (in UI) Data type Accepted values Default
output_location The Amazon S3 path where you want Amazon SageMaker to store the results of the transform job No No String
accept The MIME type used to specify the output data. Amazon SageMaker uses the MIME type with each http call to transfer data from the transform job Yes Yes String
assemble_with Defines how to assemble the results of the transform job as a single S3 object. To concatenate the results in binary format, specify None. To add a newline character at the end of every transformed record, specify Line Yes Yes String Line, None None
output_encryption_key The AWS Key Management Service key to encrypt the model artifacts at rest using Amazon S3 server-side encryption Yes Yes String KmsKeyId formats

The following parameters are used to construct TransformResources object of the CreateTransformJob API. These describe the resources, including ML instance types and ML instance count, to use for the transform job.

Argument Description Optional (in pipeline definition) Optional (in UI) Data type Accepted values Default
instance_type The ML compute instance type for the transform job Yes Yes String ml.m4.xlarge, ml.m4.2xlarge, ml.m4.4xlarge, ml.m4.10xlarge, ml.m4.16xlarge, ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ml.m5.12xlarge, ml.m5.24xlarge, ml.c4.xlarge, ml.c4.2xlarge, ml.c4.4xlarge, ml.c4.8xlarge, ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge, ml.p3.16xlarge, ml.c5.xlarge, ml.c5.2xlarge, ml.c5.4xlarge, ml.c5.9xlarge, ml.c5.18xlarge, ml.g4dn.xlarge, ml.g4dn.2xlarge, ml.g4dn.4xlarge, ml.g4dn.8xlarge, ml.g4dn.12xlarge, ml.g4dn.16xlarge ml.m4.xlarge
instance_count The number of ML compute instances to use in the transform job Yes Yes Integer 1
resource_encryption_key The AWS Key Management Service (AWS KMS) key used to encrypt model data on the storage volume attached to the ML compute instance(s) that run the batch transform job. Yes Yes String VolumeKmsKeyId formats

The following parameters are used to construct DataProcessing object of the CreateTransformJob API. The data structure used to specify the data to be used for inference in a batch transform job and to associate the data that is relevant to the prediction results in the output.

Argument Description Optional (in pipeline definition) Optional (in UI) Data type Accepted values Default
input_filter A JSONPath expression used to select a portion of the input data to pass to the algorithm. ReadMore on InputFilter Yes Yes String
output_filter A JSONPath expression used to select a portion of the joined dataset to save in the output file for a batch transform job. ReadMore on OutputFilter Yes Yes String
join_source Specifies the source of the data to join with the transformed data. ReadMore on JoinSource Yes Yes String Input, None None

Notes:

  • Please use the links in the Resources section for detailed information on each input parameter and SageMaker APIs used in this component

Outputs

Name Description
output_location The Amazon S3 path where you want Amazon SageMaker to store the results of the transform job

Requirements

Samples

Integrated into a pipeline

MNIST Classification pipeline: Pipeline | Steps

Resources