Confidence Intervals for Regression Models

If our regression model is not a perfect fit for our training data (and this is mostly the case), we might want to know how confident our model is about the predicted value.

Generating a confidence interval for the prediction can help us to understand this, because we can use it to calculate a confidence score for the prediction.

In this post, I will explore possible options to generate a confidence interval for regression models.

Methods

1. Statistical (Naive) Method

We can use residuals in our dataset to estimate a confidence interval.

It gives us a statistically proven range, where we decide the confidence threshold, for any sample.

So 95% coverage (which means we set alpha=0.05) means we will have residuals lower than this value for 95% of the samples in our dataset.

However, although this method is very simple and doesn't add much computational complexity, there are some considerations:

This method doesn't depend on the input features or the weights of the model, that's why it doesn't change with different samples. So we may compute separate quantiles for different ranges in the data, but this requires enough samples.
If we calculate the residuals on training data, our estimation might be over-optimistic as model has already seen these samples and residuals are probably lower than real time usage. If we calculate them on the test data instead, we need to make sure that we have enough samples.

2. Generate both Mean & Standard Deviation as Output

If we can assume our data is normally distributed, then we can train a model with Maximum Likelihood Estimation (MLE), which uses negative log likelihood as the loss function.

So in mathematical terms,

assuming normally distributed data

\[y \sim N(μ, σ²)\]

probability density function is

\[p(y | μ, σ) = (1 / (√(2π) σ)) · exp( – (y – μ)² / (2σ²) )\]

loss can be calculated using

\[– log(p(y | μ, σ))$$ $$= – [ – log(√(2π)) – log(σ) – (y – μ)²/(2σ²) ]$$ $$= log(σ) + (y – μ)²/(2σ²) + (1/2) log(2π)\]

where $μ$ is the mean and $σ$ is the standard deviation, which are generated by the model; and $y$ is the actual label.

3. Ensemble Methods

Ensemble methods are empirical techniques (appendix-1), where we train multiple different models, each produces a slightly different output for the same sample.

Then we can use the variance in the outputs to create a confidence interval, in a similar way to the 1st method.

So if we want to generate 95% confidence interval (alpha=0.05), we sort the predictions ascending, and our lower bound is at the 2.5th percentile, upper bound is at the 97.5th percentile.

How do we train multiple different models?

Use different random initializations
Use different hyperparameters or architectures
Use different subsets of the data (bootstrap aggregation or bagging)

Bootstrapping

Train multiple models by resampling the training data with replacement, which means creating bootstrapped datasets from original dataset by replacing the samples in-place.
So each bootstrapped dataset will contain some duplicated samples, and it will not contain some of the samples at all.

Considerations with the ensemble methods:

As we train many models (typically around 1000), generating confidence interval will be computationally much more expensive compared to the first 2 methods.
Although this method doesn't assume any underlying distribution for the data (which is advantageous when it's unknown or non-Gaussian), confidence intervals will only represent what model has learned. So if models didn't learn the data well, intervals may not correlate with actual variance.

4. Training a 2nd Model to Predict Variance

This might seem strange at first glance, but if we follow a specific method, it's possible to train a secondary model that can predict the variance (MAE or RMSE) in the first model. Then we can use this variance to calculate a confidence score for each sample.

So let's say we have a regression model trained on our training data & a set of features.
We can use this model to predict outputs on a separate dataset (either validation or else).
Now we can use the same set of features & first model's output as input to the 2nd model, then use this separate dataset to train a model that will predict the MAE or RMSE of the first model.
Why not use the same train data in both models?

Because first model is already fine-tuned for the training data, that's why it will have lower errors, which will cause a bias in the second model

Isn't predicting the output itself and the error (MAE(label - output)) same thing, what extra information can this 2nd model learn?

This is a valid question if we try to learn the actual errors, here we try to learn the MAE instead, which is closer to learning variance instead of mean (in Gaussian scenario). So basically we don't train the 2nd model with the same information.

After training this 2nd model, we can use various formulas (based on our need), to convert this value into a confidence score. A simple formula would be to use the sigmoid:

\[confidence = 1 / (1 + exp(a - b * (main~model~prediction / (max - min))))\]

where max is

\[main~model~prediction + error~model~prediction\]

and min is

\[main~model~prediction - error~model~prediction\]

5. Bayesian Neural Networks (BNN)

In a normal neural network, weights are fixed values, while in a BNN, weights are modeled as probability distributions (usually Gaussian distribution where we try to learn mean and standard deviation).

This means there is no single deterministic output from a BNN, we sample from the weight distributions to compute the output.

So output of the model is also a distribution, which can be used to compute a confidence interval.

Considerations with this approach:

Training a true BNN is generally unfeasible, as it requires learning the full posterior distribution. Why?

So we want to learn

\[p(w | D) = [p(D | w) · p(w)] / p(D)\]

where:

$p(D | w)$ is the likelihood of the data given weights w,
$p(w)$ is the prior over weights, and
$p(D)$ is a normalization factor (or evidence) which is an integral over all possible weights.
Computing $p(D)$ requires integral over a very high dimensional space, and it's not feasible with practical computational resources.

Can we estimate/approximate the posterior? Yes, there are 2 basic methods:

Variational Inference :
- Instead of $p(w | D)$, introduce a simpler, parameterized distribution $q(w | θ)$, which will approximate the posterior.
- Adjust $θ$ to minimize Kullback-Leibler (KL) divergence between two distributions.
Markov Chain Monte Carlo (MCMC) :
- Draw samples from true posterior.
- Construct a Markov chain that has $p(w | D)$ as its stationary distribution.
- After a “burn-in” period, we can treat subsequent samples as representative of the posterior.
- For large NNs, this method may be slow to converge as well.

Monte Carlo Dropout as a Practical Workaround

Instead of above two, we can use a completely different method to sample from the posterior, as a workaround.
Monte Carlo Dropout is originally a regularization technique that randomly drops weights during training to prevent overfitting.
Gal and Ghahramani showed that if we leave dropout active during testing and perform multiple forward passes, we effectively obtain samples from an approximate posterior over the network’s weights. (references-1)

Result

Below is a summary table comparing certain aspects of each table, to make the decision process easier.

Method	Approach Type	Computational Complexity	Assumptions
1 - Naive Statistical Method	Empirical	Very Low	Residual distribution is representative
2 - Mean/Standard Deviation Output (MLE Loss)	Parametric	Low	Outputs are normally distributed
3 - Ensemble Methods	Empirical	High	-
4 - Training a 2nd Model to Predict Variance	Empirical	Moderate	-
5 - Bayesian Neural Networks (BNN)	Parametric	Very High (or Moderate if using MC Dropout)	Model weights have a known prior (e.g. Gaussian)

References

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Appendices

1. Empirical vs Parametric

Parametric methods assume a specific form for the underlying data distribution (e.g. normal distribution). So we try to estimate some parameters based on the data. (e.g. $(μ, σ)$ in normal distribution)
Empirical methods directly uses dataset to make their estimations. (e.g. Naive method that uses training data to predict a confidence interval)

Method	Approach Type	Explanation
1 - Naive Method	Empirical	Calculates uncertainty directly from the observed residuals of the training data
2 - Mean/Standard Deviation Output (MLE Loss)	Parametric	Assumes a specific form (typically a Gaussian distribution) for the prediction error
3 - Ensemble Methods	Empirical	Variability among these predictions from multiple independent models are used directly to estimate uncertainty
4 - Training a 2nd Model to Predict Variance	Empirical	Trains a separate model to predict the variance of the first model's predictions
5 - Bayesian Neural Networks	Parametric	Places prior distributions on the model weights and computes a posterior distribution based on a chosen parametric family