Bringing Your Own Algorithm to SageMaker

Introduction and How Built-in Algorithms Work (Lesson 235)

Built-in Algorithms

Optimized for the AWS Cloud and straightforward to scale.
Includes popular choices like XGBoost, DeepAR, PCA, FM.

Pre-built Container Images

Compatible with MxNet, TensorFlow, scikit-learn, PyTorch.
A wide selection of algorithms and the ability to develop new ones.

Extend Pre-built Container Images

Tailor pre-built images to specific needs.

Custom Container Images

Ideal for proprietary algorithms or when using frameworks not supported by SageMaker.

Training & Hosting with SageMaker (Lesson 236)

Built-in Algorithms – Training

Data Storage: Training and test data are securely stored in S3.
Configuration: Includes the algorithm image, hyperparameters, and details about the instance type and count.
Training Process: SageMaker manages the instance setup, data retrieval from ECR and S3, and stores the model back to S3.

Custom Image – Training and Hosting

Container Requirements: Must meet SageMaker specifications and stored in ECR.
Entry Points: Containers need to define training and serving entry points.
Model Serialization: Post-training, model artifacts are serialized to a directory for SageMaker to upload to S3.

Built-in Algorithms – Hosting (Realtime, Batch)

Model Deployment: Utilizes the same Docker image from training, with the option to select specific models for deployment.
Hosting Configuration: Involves specifying the model's S3 location and configuring the instance details.

Custom and Framework Images – Hosting

Framework Adaptation: Use SageMaker's containers for popular frameworks for easier adaptation.
Local Hosting: Option to host the model on a SageMaker notebook instance for development and testing.

Container Folder Structure (Lesson 237)

SageMaker Training and Hosting Requirements

Folder Structure: A standard structure is used for data, code, model, and output.
Instrumentation and Logs: Use standard output/error and CloudWatch for logging.
Metric Capture: Metrics are logged and captured using regex patterns.
Image Strategy: Single image for training and hosting, or separate images if needed.

Training and Hosting Folder Structure

/opt/ml/ contains subfolders:
input/ for configuration and data.
- config/ for hyperparameters.
- hyperparameters.json for hyperparameters.
- resourceconfig.json for instance type and count.
- data/ for training and test data.
- channels/ for multiple data channels.
code/ for scripts.
model/ for trained models.
output/ for failure captures.
- failure/ for failure logs.

Lab (Lecture 238 & 239)

Scikit-Learn Training and Serving Example
- I had an error about numexpr package. I fixed it with below command:
```
!pip install numexpr==2.8.0 --upgrade
```

Built Your Own Container

I had an error in this line: (I think it's related to sagemaker versions)

from sagemaker.predictor import csv_serializer

predictor = tree.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)

I had to change it to:

from sagemaker.serializers import CSVSerializer

csv_serializer = CSVSerializer()

predictor = tree.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)

I had another error in this line:

transformer.transform(
    data_location, 
    content_type="text/csv", 
    split_type="Line", 
    input_filter="$[1:]"
)
transformer.wait()

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTransformJob operation: The account-level service limit 'ml.m4.xlarge for transform job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.

I had to request an increase for ml.m5.xlarge for transform job usage quota. See this link for more details.