Abdullah Şamil Güser

SageMaker Service Overview

How is AWS SageMaker different from other ML frameworks? (Lesson 49)

Overview of SageMaker

Key Differences from Direct Framework Usage

  1. Framework Installation:
    • Traditional ML frameworks like SKLearn, TensorFlow, or PyTorch can be installed and used locally on a laptop.
    • SageMaker, being cloud-based, requires the use of AWS resources.
  2. Production Application:
    • Converting local ML models into production applications often requires additional tools or cloud services.
    • SageMaker simplifies this process, making it more straightforward to transition models to production.
  3. Cost and Trial Period:
    • Using SageMaker involves some cost, but it offers a two-month free trial for getting started.
  4. Containers and Built-in Algorithms:
    • SageMaker utilizes containers to encapsulate algorithms and frameworks.
    • Provides containers for built-in algorithms (XGBoost, DeepAR, FM, etc.) and popular frameworks (PyTorch, SKLearn, TensorFlow).
  5. Training Parameters:
    • Requires specifying training parameters like data location (S3 or EFS), hyperparameters, server type for training, and storage location for trained artifacts.
  6. Script File for Frameworks:
    • In addition to training parameters, frameworks on SageMaker require a script file.
    • This script includes the model-building code specific to PyTorch, TensorFlow, or SKLearn.
  7. Hosting Models:
    • Post-training, models can be hosted on SageMaker for deployment.
    • SageMaker’s containerized approach offers a standardized interface for building and deploying models across different algorithms and frameworks.

Conclusion

Introduction to SageMaker (Lesson 50)

Key Capabilities

  1. Jupyter Notebook Environment:
    • Managed Jupyter notebook instances for development and data preparation.
    • Automated Python installation and patch application.
    • Custom Python packages installation supported.
  2. Machine Learning Algorithm Support:
    • Wide variety of optimized ML algorithms for AWS Cloud.
    • Optimized environments for frameworks like TensorFlow, Apache MxNet.
    • Custom algorithm deployment capability.
  3. Training Infrastructure:
    • Scalable training on one or multiple compute instances.
    • Handles large datasets, with trained model artifacts stored in S3.
  4. Model Deployment:
    • Real-time prediction support with load-balanced compute instances.
    • Auto-scaling for instance management and workload adaptation.
    • Batch transform for non-interactive, large-scale inference tasks.

Deployment Options

Summary

Instance Type and Pricing (Lesson 51)

Instance Families

  1. Standard Instances:
    • Low-cost, balanced performance and memory.
    • T2, T3 (occasional burst utilization), M5 (sustained load) instances.
  2. Compute Optimized Instances:
    • High-performance CPUs for CPU-intensive tasks.
    • C4, C5 instance types.
  3. Accelerated Computing Instances:
    • Powerful GPUs for GPU-optimized algorithms.
    • P2, P3 instances for faster training and GPU-enabled hosting.
  4. Inference Acceleration:
    • Fractional GPU capabilities as an add-on to other instances.
    • Suitable for models needing partial GPU support in inference.

Choosing Instances

Pricing Components

Free-Tier Offer (for New Users)

Beyond Free-Tier

Summary

SageMaker provides a range of instance options catering to various ML requirements, with pricing involving multiple components. Users can choose instances based on their algorithm’s needs and manage costs effectively by right-sizing and using the free-tier benefits.

Save Money on SageMaker Usage (Lesson 52)

1. SageMaker Savings Plan

2. Managed Spot Training

3. Learning with SageMaker

By utilizing these methods, users can significantly reduce their AWS SageMaker costs while maintaining efficient machine learning operations.

Also refer to the CloudPractitionerReview-InfraPricingSupport.pdf. This review material provides a quick overview of important concepts related to infrastructure, pricing, support plans, and shared responsibility model in the cloud

Data Format (Lesson 53)

Supported Data Formats

Storage and Retrieval

Usage of Channels

Data Transfer Modes

  1. File Mode:
    • Entire data from S3 is copied to training instance volumes.
    • Requires enough disk space on training instances for the full dataset and model artifacts.
  2. Pipe Mode:
    • Streams data from S3 to training instances.
    • Faster start times and better throughput.
    • Reduces storage needs on training instances.

Hands-On Lab: Data Format Exploration

Summary

SageMaker Built-in Algorithms (Lesson 54)

AWS SageMaker provides a range of built-in algorithms optimized for cloud-based machine learning. These algorithms can be broadly categorized based on their applications:

Text Data Algorithms

  1. BlazingText:
    • Modes: Unsupervised (Word2Vec) and supervised.
    • Use: Converts text to vector, grouping semantically similar words. Useful for text classification.
  2. Object2Vec:
    • Type: Supervised.
    • Use: Converts text to vector while capturing sentence structure. Suitable for associating customers with products, movies with ratings, etc.

Recommender Systems and Collaborative Filtering

  1. Factorization Machines:
    • Type: Suitable for high-dimensional sparse datasets.
    • Use: Popular for building recommender systems and collaborative filtering.

Classification and Regression

  1. K-Nearest Neighbor (KNN):
    • Use: Simple, effective for classification (majority class of k-nearest neighbors) and regression (average value of k-nearest neighbors).
  2. Linear Models:
    • Types: Linear regression, logistic regression, multinomial logistic regression.
  3. XGBoost:
    • Use: Gradient boosted tree algorithm for both regression and classification.

Time Series Forecasting

  1. DeepAR:
    • Use: Trains on multiple time series, predicts new similar series.

Image Analysis

  1. Object Detection:
    • Use: Detects and classifies objects in images, provides bounding boxes.
  2. ImageClassification:
    • Use: Multi-label classification of images.
  3. Semantic Segmentation:
    • Use: Tags each pixel in an image with a class label, useful in computer vision.

Language Processing

  1. Sequence to Sequence:
    • Use: Useful for text summarization, language translation, speech to text.

Clustering and Topic Modeling

  1. K Means:
    • Type: Unsupervised clustering algorithm.
  2. LDA (Latent Dirichlet Allocation):
    • Use: Groups documents based on topics.
  3. Neural Topic Modeling:
    • Use: Similar to LDA, for document grouping by topics.

Dimensionality Reduction

  1. PCA (Principal Component Analysis):
    • Use: Reduces dataset dimensions while retaining information.

Anomaly Detection

  1. Random Cut Forest:
    • Use: Detects anomalies in data, assigns scores to points.
  2. IP Insights:
    • Use: Detects unusual network activity, useful for security applications.

Summary

SageMaker Ground Truth

SageMaker Neo

Developing Models with Other Frameworks

Summary

SageMaker provides a flexible environment for machine learning, offering: