Abdullah Şamil Güser

XGBoost

Introduction to XGBoost (Lesson 68)

XGBoost in Machine Learning

Linear Models

Tree-Based Models

Ensemble Methods

XGBoost vs. Linear Models

Conclusion

Labs - Local Mode

1. Data Preparation and Training Simple Regression (Lesson 69&70)

! Sagemaker resets the notebook each time we stop and start the notebook instance.

2. Data Preparation and Training Non-linear Data set (Lesson 71&72)

3. Data Preparation and Training for Bike Rental Regression (Lesson 74&75)

4. Train using Log of Count (Lesson 76)

! The last 3 example are same with Lab 1 regarding the Sagemaker knowledge.

How to increase quota limit (Lesson 77)

If you encounter a ResourceLimitExceeded Error, follow these steps to request a quota increase:

Steps to Request Quota Increase

  1. Navigate to Service Quota Console: Go to Service Quota Console.
  2. Search for Quota Names: Look for the following quotas:
    • ml.m5.xlarge for notebook instance usage
    • ml.m5.xlarge for training job usage
    • ml.m5.xlarge for spot training job usage
    • ml.m5.xlarge for endpoint usage
  3. Request Quota Increase:
    • Click on the quota name to view its detail page.
    • Scroll to the “Request Quota Increase” section.
    • If the current quota value is 0, request an increase to 1.

Additional Resources

Labs - Using SageMaker Provided Algorithms

There are four steps when you use SageMaker cloud:

  1. You need to store your training and validation files in S3
  2. Specify the training algorithm and hyperparameters
  3. Configure the type of server and number of of servers to use for training
  4. Create a realtime endpoint for the trained model.

How to train using SageMaker’s built-in XGBoost Algorithm (Lesson 78)

Q&A: How does SageMaker built-in know the target variable? (Lesson 79)

How to run predictions against an existing SageMaker Endpoint (Lesson 80)

TODOs in this lab:

Let’s go:

SageMaker Endpoint Features (Lesson 82)

Integration with CloudWatch and Auto Scaling

High Availability and Load Balancing

Metrics for Monitoring and Scaling

Handling Different Versions of Models

SageMaker Spot Instances - Save up to 90% for training jobs (Lesson 83)

Spot instances in AWS SageMaker can significantly reduce the cost of training machine learning models, offering discounts of up to 90% compared to on-demand instances.

Advantages of Using Spot Instances

  1. Cost-Effective: Huge discounts (over 80%, varies by instance type and size).
  2. Flexibility: Better chances of obtaining an instance when flexible with type and size.
  3. Resumption of Training: SageMaker handles spot-interruptions and resumes training when capacity is available.

Considerations and Best Practices

Understanding Job Output Metrics

Useful Resources

More Labs

Lab - Multi-class Classification (Lesson 84)

Lab - Binary Classification (Lesson 85)

Data Leakage (Lesson 89)

Common Sources of Data Leakage

  1. Using Whole Data Statistics for Feature Engineering: Don’t apply normalization, standardization, or imputation using statistics from the entire dataset, as it includes test data information.

  2. Duplicate Data Points: Don’t Allow duplicates in your test data, as the same data points might appear in both training and testing sets.

  3. Handling Time Series Data: Don’t Split time series data randomly. It can lead to future information leaking into the training set due to the sequential nature of the data.

HyperParameter Tuning, Bias-Variance, Regularization (Lesson 90)

Key Hyperparameters in XGBoost

  1. Objective: Sets the learning objective. Common options include:
    • reg:linear for linear regression.
    • binary:logistic for binary classification.
    • multi:softmax for multiclass classification.
  2. Number of Rounds: Determines the number of trees for learning. More trees can lead to overfitting.

  3. Early Stopping Rounds: Helps prevent overfitting by stopping the training when the validation loss doesn’t decrease for a specified number of rounds.

Bias and Variance in Machine Learning

Understanding Regularization

Hyperparameter Tuning