Deep Learning and Neural Networks

Regression - Gradient Descent Batch, Mini-Batch, Stochastic, Loss, RMSProp, Adam (Lesson 209)

Objective

Explore how the linear regression algorithm trains a model and understand the role of gradient descent and loss functions.

Concepts

Linear Regression:
Basic algorithm for predictive modeling.
Involves finding the best weights for input features to predict an output.
Loss Function (Mean Squared Error):
Measures the difference between predicted values and actual values.
Lower loss indicates better predictions.
Gradient Descent:
A systematic approach to find optimal weights.
Uses the gradient of the loss function to adjust weights.
Moves weights in the direction that reduces the loss.
Learning Rate:
Controls the magnitude of weight adjustments.
Low learning rate: Many small steps to reach the optimal weight.
High learning rate: May overshoot the optimal weight.
Variants of Gradient Descent:
Batch Gradient Descent: Adjusts weights using all training examples in each iteration.
Stochastic Gradient Descent (SGD): Adjusts weights based on each training example.
Mini Batch Gradient Descent: Combines batch and SGD, adjusting weights using small subsets of training data.
Optimization Techniques:
Adaptive learning rate algorithms (e.g., RMSProp, Adagrad, Adam) to improve convergence.
Momentum to reduce oscillations and speed up convergence.

Process

Training Linear Regression:
Initialize random weights.
Calculate loss using loss function.
Adjust weights based on the gradient of the loss function.
Gradient Descent in Action:
Plot loss versus weight.
Start with a random weight.
Move weight in the direction that reduces the loss (opposite of the gradient).
Continue until reaching an optimal weight.
Dealing with Multiple Features:
Loss plot becomes multidimensional.
Adjust weights of all features simultaneously.

Classification - Gradient Descent, Loss Function (Lesson 210)

Objective

Explore logistic regression as a classification algorithm and understand its functioning in predicting probabilities.

Concepts

Logistic Regression Overview:
A classification algorithm, despite the name suggesting regression.
Similar to linear regression but predicts probabilities (0 to 1) using a sigmoid function.
Sigmoid Function:
Key component in logistic regression.
Converts any input to a value between 0 and 1, ideal for probability predictions.
Model Training:
Input features (X) with corresponding weights (W).
Model predicts the probability of belonging to a positive class.
Cutoff generally at 0.5 for classifying into positive or negative classes.
Logistic Loss Function:
Measures the quality of predictions.
Composed of two parts: one for positive and one for negative samples.
Loss is high for misclassifications and low for accurate predictions.
Gradient Descent Optimization:
Used to find the optimal weights that minimize the logistic loss.
Process involves adjusting weights based on the loss gradient.
Produces a loss curve (parabola) from which gradient and optimal weights are determined.

Process

Applying Sigmoid Function:
Use linear model output as input to sigmoid function.
Predicts the probability of sample belonging to the positive class.
Setting Cutoff for Classification:
Default cutoff is 0.5.
Adjusting weights changes the criteria for classification.
Computing Logistic Loss:
Calculate logistic loss for a range of predicted probabilities.
Compare against actual labels to evaluate loss.
Weight Optimization with Gradient Descent:
Start with random weights.
Adjust weights iteratively to minimize logistic loss.

Neural Networks and Deep Learning (Lesson 211)

Objective

Understand the structure and functioning of neural networks in deep learning.

Neural Network Structure

Basic Architecture:
Comprises an input layer, hidden layers, and an output layer.
Appears similar to logistic regression but extends with multiple neurons in hidden layers.
Neurons and Activation Functions:
Neurons generate new features by blending existing features with different weights.
Activation functions introduce non-linearity, improving handling of complex datasets.
Common Activation Functions:
Sigmoid: Converts input to a range between 0 and 1.
Tanh (Hyperbolic Tangent): Output ranges from -1 to 1.
ReLU (Rectified Linear Unit): Outputs 0 for negative input, and raw input for positive values.

Network Types and Applications

General-Purpose Networks:
Fully connected; each neuron in a layer connected to all neurons in the next layer.
Useful for diverse applications but may lead to overfitting.
Convolutional Neural Networks (CNNs):
Specialized for image and video analysis.
Focuses on patterns around each pixel, not just the pixel itself.
Recurrent Neural Networks (RNNs):
Ideal for time series forecasting and natural language processing.
Capable of remembering historical data, crucial for sequence-dependent predictions.

Key Points

Benefits of Neural Networks:
Can fit nonlinear datasets effectively.
Automatically generates new feature combinations.
Highly scalable and adaptable for various complex applications.
Challenges:
Complexity in tuning and risk of overfitting.
Requires extensive computation, especially for large networks.

Labs

Regression with SKLearn Neural Network (Lesson 213)
Regression with Keras and TensorFlow (Lesson 214)
Binary Classification - Customer Churn Prediction (Lesson 216 & 217)
Multiclass Classification - Iris (Lesson 218)

Convolutional Neural Network (CNN) (Lesson 230)

How CNNs Work

Convolution Operation:
CNNs break down images into smaller squares (patches) using a sliding window.
For instance, a 4x4 filter slides across the image, capturing each 4x4 patch.
Each neuron receives a patch rather than an individual pixel, preserving spatial context.
Feature Learning:
Neurons in CNNs learn to differentiate between different classes of image features (e.g., cars vs. faces).
They identify dominant characteristics specific to each image class.

Advantages of CNNs

Preservation of Spatial Relationships: By analyzing patches rather than individual pixels, CNNs maintain the spatial hierarchy of pixels, crucial for understanding image content.
Efficiency: CNN models are generally smaller and more efficient compared to deep, general-purpose neural networks for image classification.
Improved Performance: CNNs typically outperform traditional networks in image-related tasks due to their ability to capture and learn from spatial information in images.

Reference

MIT 6.S191 Lecture: "Convolutional Neural Networks" by Ava Soleimany.
Video Link: MIT 6.S191: Convolutional Neural Networks

231. Recurrent Neural Networks (RNN), LSTM

Key Characteristics of RNNs

Sequential Processing: Unlike general-purpose neural networks that process single inputs, RNNs handle sequences of inputs (e.g., series of words, stock prices over time).
Memory Mechanism: RNNs maintain an internal state to remember past information, crucial for sequential decision-making.
Feedback Loops: These loops allow RNNs to update and maintain their internal state based on new inputs and previously learned information.

LSTM Networks

Long Term and Short Term Memory: LSTMs are a special kind of RNN capable of learning long-term dependencies.
Selective Memory: They excel in remembering important past information and forgetting irrelevant details, making them effective for complex sequential tasks.

Reference

MIT 6.S191 Lecture: "Recurrent Neural Networks" by Ava Soleimany.
Video Link: MIT 6.S191: Recurrent Neural Networks

Generative Adversarial Networks (GANs) (Lesson 232)

Core Components

Two-Player Game Setup: GANs consist of two key players – the Discriminator and the Generator.
Discriminator Network: Trained to distinguish between real images (assigning high probability to real ones).
Generator Network: Produces synthetic data, such as fake images.
Learning Process: The discriminator learns to assign low probabilities to these fake images.

Game Dynamics

Concurrent Optimization: The generator tries to create images that the discriminator will perceive as real.
Stable State Goal: Achieving a state where the generator produces perfectly realistic images indistinguishable from actual data.

Applications

Synthetic Data Generation: Creating realistic synthetic images for training other models.
Diverse Object Creation: Capable of producing a wide array of objects.
Practical Use Case: Apple's utilization of GANs to merge text sources with smaller trajectory datasets to create new trajectories for expanding their dataset.

Reference

Presentation: "GANs for Good- A Virtual Expert Panel" by Ian Goodfellow.
Video Link: GANs for Good - DeepLearning.AI