If our regression model is not a perfect fit for our training data (and this is mostly the case), we might want to know how confident our model is about the predicted value.
Generating a confidence interval for the prediction can help us to understand this, because we can use it to calculate a confidence score for the prediction.
In this post, I will explore possible options to generate a confidence interval for regression models.
We can use residuals in our dataset to estimate a confidence interval.
It gives us a statistically proven range, where we decide the confidence threshold, for any sample.
alpha=0.05) means we will have residuals lower than this value for 95% of the samples in our dataset.However, although this method is very simple and doesn’t add much computational complexity, there are some considerations:
If we can assume our data is normally distributed, then we can train a model with Maximum Likelihood Estimation (MLE), which uses negative log likelihood as the loss function.
So in mathematical terms,
\(– log(p(y | μ, σ))\) \(= – [ – log(√(2π)) – log(σ) – (y – μ)²/(2σ²) ]\) \(= log(σ) + (y – μ)²/(2σ²) + (1/2) log(2π)\)
Ensemble methods are empirical techniques (appendix-1), where we train multiple different models, each produces a slightly different output for the same sample.
Then we can use the variance in the outputs to create a confidence interval, in a similar way to the 1st method.
So if we want to generate 95% confidence interval (alpha=0.05), we sort the predictions ascending, and our lower bound is at the 2.5th percentile, upper bound is at the 97.5th percentile.
How do we train multiple different models?
Bootstrapping
Train multiple models by resampling the training data with replacement, which means creating bootstrapped datasets from original dataset by replacing the samples in-place.
So each bootstrapped dataset will contain some duplicated samples, and it will not contain some of the samples at all.
Considerations with the ensemble methods:
This might seem strange at first glance, but if we follow a specific method, it’s possible to train a secondary model that can predict the variance (MAE or RMSE) in the first model. Then we can use this variance to calculate a confidence score for each sample.
Why not use the same train data in both models?
Because first model is already fine-tuned for the training data, that’s why it will have lower errors, which will cause a bias in the second model
Isn’t predicting the output itself and the error (MAE(label - output)) same thing, what extra information can this 2nd model learn?
This is a valid question if we try to learn the actual errors, here we try to learn the MAE instead, which is closer to learning variance instead of mean (in Gaussian scenario). So basically we don’t train the 2nd model with the same information.
After training this 2nd model, we can use various formulas (based on our need), to convert this value into a confidence score. A simple formula would be to use the sigmoid:
\[confidence = 1 / (1 + exp(a - b * (main~model~prediction / (max - min))))\]where max is
and min is
In a normal neural network, weights are fixed values, while in a BNN, weights are modeled as probability distributions (usually Gaussian distribution where we try to learn mean and standard deviation).
This means there is no single deterministic output from a BNN, we sample from the weight distributions to compute the output.
So output of the model is also a distribution, which can be used to compute a confidence interval.
Considerations with this approach:
Training a true BNN is generally unfeasible, as it requires learning the full posterior distribution. Why?
So we want to learn
\[p(w | D) = [p(D | w) · p(w)] / p(D)\]where:
| $p(D | w)$ is the likelihood of the data given weights w, |
Computing $p(D)$ requires integral over a very high dimensional space, and it’s not feasible with practical computational resources.
Can we estimate/approximate the posterior? Yes, there are 2 basic methods:
Variational Inference :
| Instead of $p(w | D)$, introduce a simpler, parameterized distribution $q(w | θ)$, which will approximate the posterior. |
Markov Chain Monte Carlo (MCMC) :
| Construct a Markov chain that has $p(w | D)$ as its stationary distribution. |
Monte Carlo Dropout as a Practical Workaround
Below is a summary table comparing certain aspects of each table, to make the decision process easier.
| Method | Approach Type | Computational Complexity | Assumptions |
|---|---|---|---|
| 1 - Naive Statistical Method | Empirical | Very Low | Residual distribution is representative |
| 2 - Mean/Standard Deviation Output (MLE Loss) | Parametric | Low | Outputs are normally distributed |
| 3 - Ensemble Methods | Empirical | High | - |
| 4 - Training a 2nd Model to Predict Variance | Empirical | Moderate | - |
| 5 - Bayesian Neural Networks (BNN) | Parametric | Very High (or Moderate if using MC Dropout) | Model weights have a known prior (e.g. Gaussian) |
| Method | Approach Type | Explanation |
|---|---|---|
| 1 - Naive Method | Empirical | Calculates uncertainty directly from the observed residuals of the training data |
| 2 - Mean/Standard Deviation Output (MLE Loss) | Parametric | Assumes a specific form (typically a Gaussian distribution) for the prediction error |
| 3 - Ensemble Methods | Empirical | Variability among these predictions from multiple independent models are used directly to estimate uncertainty |
| 4 - Training a 2nd Model to Predict Variance | Empirical | Trains a separate model to predict the variance of the first model’s predictions |
| 5 - Bayesian Neural Networks | Parametric | Places prior distributions on the model weights and computes a posterior distribution based on a chosen parametric family |