81 lines
5.4 KiB
Markdown
81 lines
5.4 KiB
Markdown
# SageMaker Processing Kubeflow Pipelines component
|
|
|
|
## Summary
|
|
Component to submit SageMaker Processing jobs directly from a Kubeflow Pipelines workflow.
|
|
https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html
|
|
|
|
## Intended Use
|
|
For running your data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation using AWS SageMaker.
|
|
|
|
## Runtime Arguments
|
|
Argument | Description | Optional | Data type | Accepted values | Default |
|
|
:--- | :---------- | :----------| :----------| :---------- | :----------|
|
|
region | The region where the cluster launches | No | String | | |
|
|
endpoint_url | The endpoint URL for the private link VPC endpoint | Yes | String | | |
|
|
assume_role | The ARN of an IAM role to assume when connecting to SageMaker | Yes | String | | |
|
|
job_name | The name of the Processing job. Must be unique within the same AWS account and AWS region | Yes | String | | ProcessingJob-[datetime]-[random id]|
|
|
role | The Amazon Resource Name (ARN) that Amazon SageMaker assumes to perform tasks on your behalf | No | String | | |
|
|
image | The registry path of the Docker image that contains the processing script | Yes | String | | |
|
|
instance_type | The ML compute instance type | Yes | String | ml.m4.xlarge, ml.m4.2xlarge, ml.m4.4xlarge, ml.m4.10xlarge, ml.m4.16xlarge, ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ml.m5.12xlarge, ml.m5.24xlarge, ml.c4.xlarge, ml.c4.2xlarge, ml.c4.4xlarge, ml.c4.8xlarge, ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge, ml.p3.16xlarge, ml.c5.xlarge, ml.c5.2xlarge, ml.c5.4xlarge, ml.c5.9xlarge, ml.c5.18xlarge [and many more](https://aws.amazon.com/sagemaker/pricing/instance-types/) | ml.m4.xlarge |
|
|
instance_count | The number of ML compute instances to use in each processing job | Yes | Int | ≥ 1 | 1 |
|
|
volume_size | The size of the ML storage volume that you want to provision in GB | Yes | Int | ≥ 1 | 30 |
|
|
resource_encryption_key | The AWS KMS key that Amazon SageMaker uses to encrypt data on the storage volume attached to the ML compute instance(s) | Yes | String | | |
|
|
output_encryption_key | The AWS KMS key that Amazon SageMaker uses to encrypt the model artifacts | Yes | String | | |
|
|
max_run_time | The maximum run time in seconds per processing job | Yes | Int | ≤ 432000 (5 days) | 86400 (1 day) |
|
|
environment | The environment variables to set in the Docker container | Yes | Yes | Dict | Maximum length of 1024. Key Pattern: `[a-zA-Z_][a-zA-Z0-9_]*`. Value Pattern: `[\S\s]*`. Upto 16 key and values entries in the map | |
|
|
container_entrypoint | The entrypoint for the processing job. This is in the form of a list of strings that make a command | Yes | Yes | List of Strings | | [] |
|
|
container_arguments | A list of string arguments to be passed to a processing job | Yes | Yes | List of Strings | | [] |
|
|
input_config | Parameters that specify Amazon S3 inputs for a processing job | No | List of Dicts | | [] |
|
|
output_config | Parameters that specify Amazon S3 outputs for a processing job | No | List of Dict | | [] |
|
|
vpc_security_group_ids | A comma-delimited list of security group IDs, in the form sg-xxxxxxxx | Yes | String | | |
|
|
vpc_subnets | A comma-delimited list of subnet IDs in the VPC to which you want to connect your hpo job | Yes | String | | |
|
|
network_isolation | Isolates the processing container if true | No | Boolean | False, True | True |
|
|
traffic_encryption | Encrypts all communications between ML compute instances in distributed processing if true | No | Boolean | False, True | False |
|
|
tags | Key-value pairs to categorize AWS resources | Yes | Dict | | {} |
|
|
|
|
Notes:
|
|
* You can find more information about how container entrypoint and arguments are used at the [Build Your Own Processing Container](https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html#byoc-run-image) documentation.
|
|
* Each key and value in the `environment` parameter string to string map can have length of up to 1024. SageMaker supports up to 16 entries in the map.
|
|
* The format for the [`input_config`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingInput.html) field is:
|
|
```
|
|
[
|
|
{
|
|
'InputName': 'string',
|
|
'S3Input': {
|
|
'S3Uri': 'string',
|
|
'LocalPath': 'string',
|
|
'S3DataType': 'ManifestFile'|'S3Prefix',
|
|
'S3InputMode': 'Pipe'|'File',
|
|
'S3DataDistributionType': 'FullyReplicated'|'ShardedByS3Key',
|
|
'S3CompressionType': 'None'|'Gzip'
|
|
}
|
|
},
|
|
]
|
|
```
|
|
* The format for the [`output_config`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingS3Output.html) field is:
|
|
```
|
|
[
|
|
{
|
|
'OutputName': 'string',
|
|
'S3Output': {
|
|
'S3Uri': 'string',
|
|
'LocalPath': 'string',
|
|
'S3UploadMode': 'Continuous'|'EndOfJob'
|
|
}
|
|
},
|
|
]
|
|
```
|
|
|
|
## Outputs
|
|
Name | Description
|
|
:--- | :----------
|
|
job_name | Processing job name
|
|
output_artifacts | A dictionary mapping with `output_config` `OutputName` as the key and `S3Uri` as the value
|
|
|
|
## Requirements
|
|
* [Kubeflow pipelines SDK](https://www.kubeflow.org/docs/pipelines/sdk/install-sdk/)
|
|
* [Kubeflow set-up](https://www.kubeflow.org/docs/aws/deploy/install-kubeflow/)
|
|
|
|
## Resources
|
|
* [Create Processing Job API documentation](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html)
|
|
* [Boto3 API reference](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_processing_job) |