Deep Learning and Neural Networks
Regression - Gradient Descent Batch, Mini-Batch, Stochastic, Loss, RMSProp, Adam (Lesson 209)
Objective
- Explore how the linear regression algorithm trains a model and understand the role of gradient descent and loss functions.
Concepts
- Linear Regression:
- Basic algorithm for predictive modeling.
- Involves finding the best weights for input features to predict an output.
- Loss Function (Mean Squared Error):
- Measures the difference between predicted values and actual values.
- Lower loss indicates better predictions.
- Gradient Descent:
- A systematic approach to find optimal weights.
- Uses the gradient of the loss function to adjust weights.
- Moves weights in the direction that reduces the loss.
- Learning Rate:
- Controls the magnitude of weight adjustments.
- Low learning rate: Many small steps to reach the optimal weight.
- High learning rate: May overshoot the optimal weight.
- Variants of Gradient Descent:
- Batch Gradient Descent: Adjusts weights using all training examples in each iteration.
- Stochastic Gradient Descent (SGD): Adjusts weights based on each training example.
- Mini Batch Gradient Descent: Combines batch and SGD, adjusting weights using small subsets of training data.
- Optimization Techniques:
- Adaptive learning rate algorithms (e.g., RMSProp, Adagrad, Adam) to improve convergence.
- Momentum to reduce oscillations and speed up convergence.
Process
- Training Linear Regression:
- Initialize random weights.
- Calculate loss using loss function.
- Adjust weights based on the gradient of the loss function.
- Gradient Descent in Action:
- Plot loss versus weight.
- Start with a random weight.
- Move weight in the direction that reduces the loss (opposite of the gradient).
- Continue until reaching an optimal weight.
- Dealing with Multiple Features:
- Loss plot becomes multidimensional.
- Adjust weights of all features simultaneously.
Classification - Gradient Descent, Loss Function (Lesson 210)
Objective
- Explore logistic regression as a classification algorithm and understand its functioning in predicting probabilities.
Concepts
- Logistic Regression Overview:
- A classification algorithm, despite the name suggesting regression.
- Similar to linear regression but predicts probabilities (0 to 1) using a sigmoid function.
- Sigmoid Function:
- Key component in logistic regression.
- Converts any input to a value between 0 and 1, ideal for probability predictions.
- Model Training:
- Input features (X) with corresponding weights (W).
- Model predicts the probability of belonging to a positive class.
- Cutoff generally at 0.5 for classifying into positive or negative classes.
- Logistic Loss Function:
- Measures the quality of predictions.
- Composed of two parts: one for positive and one for negative samples.
- Loss is high for misclassifications and low for accurate predictions.
- Gradient Descent Optimization:
- Used to find the optimal weights that minimize the logistic loss.
- Process involves adjusting weights based on the loss gradient.
- Produces a loss curve (parabola) from which gradient and optimal weights are determined.
Process
- Applying Sigmoid Function:
- Use linear model output as input to sigmoid function.
- Predicts the probability of sample belonging to the positive class.
- Setting Cutoff for Classification:
- Default cutoff is 0.5.
- Adjusting weights changes the criteria for classification.
- Computing Logistic Loss:
- Calculate logistic loss for a range of predicted probabilities.
- Compare against actual labels to evaluate loss.
- Weight Optimization with Gradient Descent:
- Start with random weights.
- Adjust weights iteratively to minimize logistic loss.
Neural Networks and Deep Learning (Lesson 211)
Objective
- Understand the structure and functioning of neural networks in deep learning.
Neural Network Structure
- Basic Architecture:
- Comprises an input layer, hidden layers, and an output layer.
- Appears similar to logistic regression but extends with multiple neurons in hidden layers.
- Neurons and Activation Functions:
- Neurons generate new features by blending existing features with different weights.
- Activation functions introduce non-linearity, improving handling of complex datasets.
- Common Activation Functions:
- Sigmoid: Converts input to a range between 0 and 1.
- Tanh (Hyperbolic Tangent): Output ranges from -1 to 1.
- ReLU (Rectified Linear Unit): Outputs 0 for negative input, and raw input for positive values.
Network Types and Applications
- General-Purpose Networks:
- Fully connected; each neuron in a layer connected to all neurons in the next layer.
- Useful for diverse applications but may lead to overfitting.
- Convolutional Neural Networks (CNNs):
- Specialized for image and video analysis.
- Focuses on patterns around each pixel, not just the pixel itself.
- Recurrent Neural Networks (RNNs):
- Ideal for time series forecasting and natural language processing.
- Capable of remembering historical data, crucial for sequence-dependent predictions.
Key Points
- Benefits of Neural Networks:
- Can fit nonlinear datasets effectively.
- Automatically generates new feature combinations.
- Highly scalable and adaptable for various complex applications.
- Challenges:
- Complexity in tuning and risk of overfitting.
- Requires extensive computation, especially for large networks.
Labs
- Regression with SKLearn Neural Network (Lesson 213)
- Regression with Keras and TensorFlow (Lesson 214)
- Binary Classification - Customer Churn Prediction (Lesson 216 & 217)
- Multiclass Classification - Iris (Lesson 218)
Convolutional Neural Network (CNN) (Lesson 230)
How CNNs Work
- Convolution Operation:
- CNNs break down images into smaller squares (patches) using a sliding window.
- For instance, a 4x4 filter slides across the image, capturing each 4x4 patch.
- Each neuron receives a patch rather than an individual pixel, preserving spatial context.
- Feature Learning:
- Neurons in CNNs learn to differentiate between different classes of image features (e.g., cars vs. faces).
- They identify dominant characteristics specific to each image class.
Advantages of CNNs
- Preservation of Spatial Relationships: By analyzing patches rather than individual pixels, CNNs maintain the spatial hierarchy of pixels, crucial for understanding image content.
- Efficiency: CNN models are generally smaller and more efficient compared to deep, general-purpose neural networks for image classification.
- Improved Performance: CNNs typically outperform traditional networks in image-related tasks due to their ability to capture and learn from spatial information in images.
Reference
- MIT 6.S191 Lecture: “Convolutional Neural Networks” by Ava Soleimany.
231. Recurrent Neural Networks (RNN), LSTM
Key Characteristics of RNNs
- Sequential Processing: Unlike general-purpose neural networks that process single inputs, RNNs handle sequences of inputs (e.g., series of words, stock prices over time).
- Memory Mechanism: RNNs maintain an internal state to remember past information, crucial for sequential decision-making.
- Feedback Loops: These loops allow RNNs to update and maintain their internal state based on new inputs and previously learned information.
LSTM Networks
- Long Term and Short Term Memory: LSTMs are a special kind of RNN capable of learning long-term dependencies.
- Selective Memory: They excel in remembering important past information and forgetting irrelevant details, making them effective for complex sequential tasks.
Reference
- MIT 6.S191 Lecture: “Recurrent Neural Networks” by Ava Soleimany.
Generative Adversarial Networks (GANs) (Lesson 232)
Core Components
- Two-Player Game Setup: GANs consist of two key players – the Discriminator and the Generator.
- Discriminator Network: Trained to distinguish between real images (assigning high probability to real ones).
- Generator Network: Produces synthetic data, such as fake images.
- Learning Process: The discriminator learns to assign low probabilities to these fake images.
Game Dynamics
- Concurrent Optimization: The generator tries to create images that the discriminator will perceive as real.
- Stable State Goal: Achieving a state where the generator produces perfectly realistic images indistinguishable from actual data.
Applications
- Synthetic Data Generation: Creating realistic synthetic images for training other models.
- Diverse Object Creation: Capable of producing a wide array of objects.
- Practical Use Case: Apple’s utilization of GANs to merge text sources with smaller trajectory datasets to create new trajectories for expanding their dataset.
Reference