Machine Learning Concepts
Introduction to Machine Learning, Concepts, Terminologies (Lesson 23)
1. Types of Learning Algorithms
- Supervised Learning:
- Input data and correct answers are provided.
- Example: Classifying Iris plant types based on measurements.
- Unsupervised Learning:
- No specific target for prediction.
- Used for grouping, anomaly detection, and dimensionality reduction.
- Reinforcement Learning:
- Decision-making under uncertainty, using rewards or penalties.
- Supported by SageMaker.
2. Supervised Learning
- Data Splitting:
- Labeled data is split into a training set (70%) and a test set (30%).
- Shuffling data before splitting is crucial.
- Algorithm Types:
- Regression: Predicts a numeric output.
- Binary Classification: Predicts one of two possible outcomes.
- Multi-class Classification: Predicts one of several possible outcomes.
3. Unsupervised Learning
- Applications:
- Grouping similar observations.
- Anomaly detection.
- Feature reduction (e.g., PCA).
- Finding similar words (e.g., BlazingText, FastText).
4. Reinforcement Learning
- Used for scenarios like autonomous driving and strategy games.
- Involves learning from a mix of examples and uncertainties.
Data Types - How to handle mixed data types (Lesson 24)
Categorical Values
- Encoding: Convert text-based categories to numeric form for tree-based algorithms like XGBoost.
- One-hot Encoding: Used for algorithms like Linear Regression, converting categories into separate columns.
- Feature Combination: Combine features to form new ones for capturing non-linear relationships.
Text Data
- Bag-of-Words: Tokenize text into words; each word becomes a feature.
- NGram Transformation: Capture contiguous sequences of words to better represent meaning.
- OSB Transformation: Similar to NGrams, but includes non-contiguous word combinations.
- Stemming: Reduce words to their root form for consistency.
- Case Uniformity: Convert all text to either lowercase or uppercase.
- Punctuation Removal: Improve signal by removing punctuations.
Numeric Data
- As-is Usage: Suitable for linear relationships.
- Normalization: Adjust features to similar scales to prevent dominance of larger scale features.
- Binning: Segment numeric values into bins for capturing non-linearity.