SageMaker Service Overview
How is AWS SageMaker different from other ML frameworks? (Lesson 49)
Overview of SageMaker
- SageMaker is a cloud service that leverages AWS resources for training and hosting machine learning models.
Key Differences from Direct Framework Usage
- Framework Installation:
- Traditional ML frameworks like SKLearn, TensorFlow, or PyTorch can be installed and used locally on a laptop.
- SageMaker, being cloud-based, requires the use of AWS resources.
- Production Application:
- Converting local ML models into production applications often requires additional tools or cloud services.
- SageMaker simplifies this process, making it more straightforward to transition models to production.
- Cost and Trial Period:
- Using SageMaker involves some cost, but it offers a two-month free trial for getting started.
- Containers and Built-in Algorithms:
- SageMaker utilizes containers to encapsulate algorithms and frameworks.
- Provides containers for built-in algorithms (XGBoost, DeepAR, FM, etc.) and popular frameworks (PyTorch, SKLearn, TensorFlow).
- Training Parameters:
- Requires specifying training parameters like data location (S3 or EFS), hyperparameters, server type for training, and storage location for trained artifacts.
- Script File for Frameworks:
- In addition to training parameters, frameworks on SageMaker require a script file.
- This script includes the model-building code specific to PyTorch, TensorFlow, or SKLearn.
- Hosting Models:
- Post-training, models can be hosted on SageMaker for deployment.
- SageMaker’s containerized approach offers a standardized interface for building and deploying models across different algorithms and frameworks.
Conclusion
- SageMaker stands out in its ability to streamline the transition from model training to production deployment, backed by AWS’s robust cloud infrastructure.
Introduction to SageMaker (Lesson 50)
Key Capabilities
- Jupyter Notebook Environment:
- Managed Jupyter notebook instances for development and data preparation.
- Automated Python installation and patch application.
- Custom Python packages installation supported.
- Machine Learning Algorithm Support:
- Wide variety of optimized ML algorithms for AWS Cloud.
- Optimized environments for frameworks like TensorFlow, Apache MxNet.
- Custom algorithm deployment capability.
- Training Infrastructure:
- Scalable training on one or multiple compute instances.
- Handles large datasets, with trained model artifacts stored in S3.
- Model Deployment:
- Real-time prediction support with load-balanced compute instances.
- Auto-scaling for instance management and workload adaptation.
- Batch transform for non-interactive, large-scale inference tasks.
Deployment Options
- Real-time Prediction:
- Persistent endpoint for client application model invocations.
- Auto-replacement of instances in case of hardware/system issues.
- Customizable scaling policies based on workload.
- Batch Transform:
- Ideal for non-interactive, large dataset inference.
- Manages resource allocation, dataset processing, and results storage.
- Terminates instances post-processing, enhancing cost-effectiveness.
Summary
- SageMaker offers a comprehensive suite of services for building, training, and deploying ML models in the cloud.
- Features include a fully managed development environment, extensive algorithm and framework support, scalable training infrastructure, and flexible deployment options for real-time and batch processing.
Instance Type and Pricing (Lesson 51)
Instance Families
- Standard Instances:
- Low-cost, balanced performance and memory.
- T2, T3 (occasional burst utilization), M5 (sustained load) instances.
- Compute Optimized Instances:
- High-performance CPUs for CPU-intensive tasks.
- C4, C5 instance types.
- Accelerated Computing Instances:
- Powerful GPUs for GPU-optimized algorithms.
- P2, P3 instances for faster training and GPU-enabled hosting.
- Inference Acceleration:
- Fractional GPU capabilities as an add-on to other instances.
- Suitable for models needing partial GPU support in inference.
Choosing Instances
- Based on algorithm requirements: CPU-bound or GPU-bound.
- Experiment with instance sizes for optimal performance and cost.
- Instance naming:
ML.<instance type>.<generation>.<size> (e.g., ML.C5.2xlarge).
Pricing Components
- Instance Cost: Varies by type and size.
- Variable Costs: Depends on fractional GPUs, storage allocated, and data transfer.
- Region-Specific Pricing: Slight variations between AWS regions.
Free-Tier Offer (for New Users)
- Development: T2/T3 medium notebook instances - 250 hours/month.
- Training: M4/M5 extra-large instances - 50 hours/month.
- Hosting: M4/M5 extra-large instances - 125 hours/month.
Beyond Free-Tier
- Development (Example):
- T3 extra-large instance: $0.23/hour.
- Storage: $0.14/GB/month.
- Data Transfer: $0.016/GB (in/out).
- Training Costs: Based on instance runtime and storage during training.
- Hosting Options:
- Real-time Hosting: Cost depends on instance type, GPUs, storage, data transfer.
- Batch Transform: Charges for job duration, suitable for non-interactive tasks.
Summary
SageMaker provides a range of instance options catering to various ML requirements, with pricing involving multiple components. Users can choose instances based on their algorithm’s needs and manage costs effectively by right-sizing and using the free-tier benefits.
Save Money on SageMaker Usage (Lesson 52)
1. SageMaker Savings Plan
- Overview: Offers up to 64% discount over on-demand pricing for consistent usage.
- Commitment: Choose a 1-year or 3-year term with a fixed $/hour usage.
- Applicability: Applies to SageMaker Notebook, Studio, Training, Real-time Inference, Batch Transform, Data Wrangler, etc.
- Global Benefit: Discounts apply across regions, instance families, and sizes.
- Challenges:
- Determining steady-state usage: Use AWS Cost Explorer to analyze and recommend plans.
- Fluctuating Usage: Billing based on hourly usage, not monthly aggregate. Unused hours may not carry over.
- Example: Convert a $10/hour steady-state usage to a $4/hour savings plan to reduce monthly charges from $7,200 to $2,880.
- Resource: SageMaker Savings Plan Announcement
2. Managed Spot Training
- Discounts: Up to 90% off on AWS’s unused capacity.
- Use Case: Ideal for training and automatic model tuning jobs (not for inference).
- Handling Interruptions: SageMaker automatically manages spot instance interruptions and resumes training when capacity is available.
- No Long-term Commitment: Flexible and cost-effective for training and tuning.
- Resource: Managed Spot Training Documentation
3. Learning with SageMaker
- Two-Month Trial: Ideal for learners, providing free-tier instance types.
- Beyond Free-Tier: Managed spot training is recommended for cost savings.
- Multiple Accounts for Extended Trial: Sign up with different email IDs for additional two-month trial periods (albeit cumbersome).
By utilizing these methods, users can significantly reduce their AWS SageMaker costs while maintaining efficient machine learning operations.
Also refer to the CloudPractitionerReview-InfraPricingSupport.pdf. This review material provides a quick overview of important concepts related to infrastructure, pricing, support plans, and shared responsibility model in the cloud
- Common Formats: CSV and RecordIO (protobuf)
- Other Formats: Some SageMaker algorithms also support LibSVM, JSON, and Parquet formats.
Storage and Retrieval
- Storage Location: Data for training and validation must be stored in an Amazon S3 bucket.
- Data Access:
- Single File: If a single file in S3 is specified, its contents are used for training.
- Folder: If a folder is specified, all data within that folder is used for training.
Usage of Channels
- Train Channel: For the training dataset.
- Test Channel: For the evaluation dataset.
Data Transfer Modes
- File Mode:
- Entire data from S3 is copied to training instance volumes.
- Requires enough disk space on training instances for the full dataset and model artifacts.
- Pipe Mode:
- Streams data from S3 to training instances.
- Faster start times and better throughput.
- Reduces storage needs on training instances.
- Objective: Create a dataset in CSV and RecordIO formats, upload to S3, and download from S3.
- Tools Used: Pandas, NumPy, Boto3, SageMaker library.
- Process:
- Create a dataset with 3 input features (X1, X2, X3) and a binary target variable (Y).
- Convert the dataset to CSV using Pandas
to_csv method.
- Upload the CSV file to S3 using the
write_to_s3 helper function.
- Use
write_numpy_to_dense_tensor method from SageMaker library to create a RecordIO formatted file.
- Upload and download RecordIO file to/from S3.
Summary
- SageMaker supports multiple data formats for efficient training and prediction.
- The choice of data format (CSV, RecordIO, others) depends on algorithm compatibility and dataset size.
- Data must be stored in S3, with training and validation data distinguished via channels.
- File and Pipe modes offer different approaches to data transfer from S3 to training instances.
- Practically, users can create, upload, and manage these data formats using SageMaker’s integrated tools and AWS libraries.
SageMaker Built-in Algorithms (Lesson 54)
AWS SageMaker provides a range of built-in algorithms optimized for cloud-based machine learning. These algorithms can be broadly categorized based on their applications:
Text Data Algorithms
- BlazingText:
- Modes: Unsupervised (Word2Vec) and supervised.
- Use: Converts text to vector, grouping semantically similar words. Useful for text classification.
- Object2Vec:
- Type: Supervised.
- Use: Converts text to vector while capturing sentence structure. Suitable for associating customers with products, movies with ratings, etc.
Recommender Systems and Collaborative Filtering
- Factorization Machines:
- Type: Suitable for high-dimensional sparse datasets.
- Use: Popular for building recommender systems and collaborative filtering.
Classification and Regression
- K-Nearest Neighbor (KNN):
- Use: Simple, effective for classification (majority class of k-nearest neighbors) and regression (average value of k-nearest neighbors).
- Linear Models:
- Types: Linear regression, logistic regression, multinomial logistic regression.
- XGBoost:
- Use: Gradient boosted tree algorithm for both regression and classification.
Time Series Forecasting
- DeepAR:
- Use: Trains on multiple time series, predicts new similar series.
Image Analysis
- Object Detection:
- Use: Detects and classifies objects in images, provides bounding boxes.
- ImageClassification:
- Use: Multi-label classification of images.
- Semantic Segmentation:
- Use: Tags each pixel in an image with a class label, useful in computer vision.
Language Processing
- Sequence to Sequence:
- Use: Useful for text summarization, language translation, speech to text.
Clustering and Topic Modeling
- K Means:
- Type: Unsupervised clustering algorithm.
- LDA (Latent Dirichlet Allocation):
- Use: Groups documents based on topics.
- Neural Topic Modeling:
- Use: Similar to LDA, for document grouping by topics.
Dimensionality Reduction
- PCA (Principal Component Analysis):
- Use: Reduces dataset dimensions while retaining information.
Anomaly Detection
- Random Cut Forest:
- Use: Detects anomalies in data, assigns scores to points.
- IP Insights:
- Use: Detects unusual network activity, useful for security applications.
Summary
- SageMaker’s built-in algorithms cover a wide range of ML applications.
- They are optimized for AWS cloud, offering scalability and GPU training advantages.
- Users can choose the appropriate algorithm based on their data and use case requirements.
Popular Frameworks and Bring Your Own Algorithm (Lesson 55)
SageMaker Ground Truth
- Purpose: Assists in labeling data for machine learning.
- Capabilities:
- Automatic Labeling: Uses provided examples to label remaining data.
- Human Labeling: Distributes labeling tasks to human labelers via integration with Mechanical Turk, managing the workflow.
SageMaker Neo
- Functionality: Optimizes machine learning models for deployment across cloud and edge locations.
- Key Features:
- Cross-Compilation: Converts and optimizes models for various hardware like Intel, Nvidia, ARM.
- Use Case: Suitable for applications where low latency is crucial, allowing deployment at edge locations.
Developing Models with Other Frameworks
- Container-Based Architecture: Utilizes Docker containers for training and deploying models.
- Pre-Built Docker Images: Available for built-in algorithms and popular frameworks like TensorFlow, Apache MxNet, etc.
- Integration with Apache Spark:
- Provides a library for Python and Scala.
- Directly consumes data frames in Spark clusters, converting them to protobuf and uploading to S3.
- Framework Support: Offers support for TensorFlow, Apache MxNet, Scikit-learn, PyTorch, Chainer, etc.
- Custom Algorithm Development:
- Option to extend SageMaker Docker images or use custom container images.
- Suitable for unique runtime or language requirements.
- Advanced Use with Deep Learning AMIs:
- For modifying or extending frameworks.
- Ideal for users contributing to code or debugging at the framework level.
Summary
SageMaker provides a flexible environment for machine learning, offering:
- Built-in algorithms for ease of use and scalability.
- Support for popular frameworks via pre-built container images.
- Options to extend or bring custom container images for unique requirements.
- Ground Truth for data labeling and Neo for optimizing model deployment.