Endpoint Changes with Zero Downtime
[Repeat] Endpoint Features, Monitoring and AutoScaling (Lesson 105)
- Model deployed to an endpoint with a single instance poses a risk of failure, impacting client applications.
Multi-Instance Configuration for Reliability
- Load Balancing: Requests are distributed across available instances for better reliability.
- High Availability: Instances spread across multiple Availability Zones to withstand zone-level failures.
Auto Scaling Based on Workload
- Scaling Strategy:
- Scale-Out: Add more instances during increased workload.
- Scale-In: Reduce instances when workload decreases.
- Metrics for Decision Making:
- Number of requests per second, CPU utilization, etc.
- Use CloudWatch metrics to inform scaling decisions.
Recommended Safety Factor
- Safety Factor of 0.5: Keep instance load at or below 50%.
Hosting Multiple Model Variants
- Distribute traffic between models (e.g., 90% to existing model, 10% to new version).
- Adjust traffic distribution based on new model performance.
How to handle changes to production system? (Lesson 106)
- Single Model Endpoint
- Structure: Single serving container with a model on an instance.
- Fault Tolerance: Replication for reliability, load balancing, and auto-scaling based on traffic.
-
Limitation: Single point of failure; one container per instance.
-
Multiple Model/Production Variants
- Setup: New endpoint configuration with old and new models.
- Traffic Distribution: Configure traffic weights (e.g., 70% old model, 30% new model).
- Advantages: Zero downtime deployments, specific model variant targeting, separate auto-scaling rules.
-
Drawback: One serving container and model artifact per instance, leading to multiple server instances.
-
Multimodal Endpoint
- Usage: Host multiple models with a shared serving container (same algorithm).
-
Cost Efficiency: Reduced infrastructure costs, better utilization of endpoint instances.
-
Multi-Container Endpoint
- Functionality: Deploy multiple serving containers on a single endpoint.
- Use Cases: Inference pipelines or direct invocation of different algorithm models.
- Advantages: Host different algorithms on one instance, reduce infrastructure costs.
- Limitations: Maximum of five containers co-hosted.
Lab - A/B Testing Multiple Production Variants (Lesson 107)
- Prepare & Train Models
-
Use multiple_versions_training.ipynb to create training jobs.
-
Create SageMaker Models
-
Use training jobs to create models.
-
Create Endpoint Configuration
- Then go to
Inference -> Endpoint Configurationsand create a new endpoint configuration. - Choose
Create Production Variantand add the first version of the model.- Choose
Edit. - You can update Variant Name as
version-0-90-2 - You can select Instance Type as
ml.m5.large - Initial Instance Count and Initial Weight can both stay
1.
- Choose
- Repeat the same steps for the second version of the model, i.e.
version-1-2-2 -
We've assigned equal initial weight (1) to each model for a 50/50 traffic distribution.
-
Create Endpoint
- Go to
Inference -> Endpointsand create a new endpoint. - You can update Endpoint Name as
xgboost-bikerental -
Choose the endpoint configuration you created in the previous step.
-
Endpoint Management - Adjust Weights
-
You can adjust weights if necessary from the endpoint configuration.
-
Endpoint Management - Auto-Scaling
- You can also configure auto-scaling based on the variant invocation per instance metric.
- Select one of the variants and choose
Configure Auto Scaling. - You can set the minimum and maximum instance count and the target value for the metric.
- We need a trigger for auto scaling to respond to traffic.
- The default metric that is used for trigger is the
VariantInvocationsPerInstancemetric. -
This metric tracks the number of requests per minute, per instance.
-
Invoke Endpoint for A/B Testing
- Use the multiple_versions_prediction.ipynb notebook.
- Set endpoint name, predictor instance, and optionally specify target variant.
-
Compare results for different versions.
-
Analysis & Cleanup
- Don't forget to delete endpoint.
Lab – Multi-model Endpoint (Lesson 108)
-
Deploy two XGBoost models with different hyperparameters on a multimodal endpoint for flexibility and cost reduction.
-
Prepare & Train Models
-
Use multiple_models_training.ipynb to train the models, which will create training jobs.
-
Create SageMaker Model
- In SageMaker console, select the first training job (with
hyper1). - Choose "Use multiple models" option.
- Set model artifact location to the
modelfolder in S3. (s3://asamilg-sagemaker-mls-course/bikerental-hyper/model) -
Create the model.
-
Create Endpoint Configuration
- Name:
xboost-bikerental-hyper. - Select the created model.
-
Choose
ml.m5.largeinstance type. -
Deploy Endpoint
- Name:
xgboost-bikerental-hyper. -
Use the created endpoint configuration.
-
Invoke Multimodal Endpoint
- Open multiple_models_prediction.ipynb notebook.
-
Update
target_modelslist with model folder names in S3. -
Cleanup
- Delete the endpoint to avoid ongoing charges.
Summary
- Multimodal endpoints allow hosting multiple models with a shared serving container, reducing deployment costs and offering flexibility.
- Models must be trained using the same algorithm for compatibility with the shared container.
- Ideal for rapid deployment of new model versions without additional instance setup.
- Note: Increased latency on initial calls due to on-demand model artifact loading from S3.
Run Models at the Edge (Lesson 109)
SageMaker Neo
- Purpose: For latency-sensitive and streaming data analysis use cases, SageMaker Neo enables hosting ML models close to the data source.
- Functionality: Neo compiles a trained model into an executable for edge devices.
- Hardware Compatibility: Supports Intel, NVIDIA, ARM, Cadence, Qualcomm, and Xilinx architectures.
- Supported Models: Compatible with MXNet, TensorFlow, PyTorch, and XGBoost models trained via Amazon SageMaker.
Benefits of SageMaker Neo
- Performance Improvement: Up to 2X increase in performance without sacrificing accuracy.
- Framework Size Reduction: Can achieve up to a 10X reduction. Neo compiles both the model and framework into a single executable for edge deployment.
- Cross-Platform Compatibility: Run the same ML model across multiple hardware platforms.
AWS IoT Greengrass
- Role: Extends AWS capabilities to edge devices, enabling local execution of functions, containers, and ML model predictions.
- Key Features:
- Runs Lambda functions and Docker containers locally.
- Executes predictions using ML models offline.
- Integration with Neo:
- Models compiled with SageMaker Neo are placed in S3.
- Install GreenGrass and DLR (Deep Learning Runtime) on the edge device.
- Configure Greengrass to point to the model in S3.
- Greengrass downloads and hosts the model locally for inference.