Abdullah Şamil Güser

Confidence Intervals for Regression Models

If our regression model is not a perfect fit for our training data (and this is mostly the case), we might want to know how confident our model is about the predicted value.

Generating a confidence interval for the prediction can help us to understand this, because we can use it to calculate a confidence score for the prediction.

In this post, I will explore possible options to generate a confidence interval for regression models.

Methods

1. Statistical (Naive) Method

We can use residuals in our dataset to estimate a confidence interval.

It gives us a statistically proven range, where we decide the confidence threshold, for any sample.

However, although this method is very simple and doesn’t add much computational complexity, there are some considerations:

2. Generate both Mean & Standard Deviation as Output

If we can assume our data is normally distributed, then we can train a model with Maximum Likelihood Estimation (MLE), which uses negative log likelihood as the loss function.

So in mathematical terms,

\[y \sim N(μ, σ²)\] \[p(y | μ, σ) = (1 / (√(2π) σ)) · exp( – (y – μ)² / (2σ²) )\]

\(– log(p(y | μ, σ))\) \(= – [ – log(√(2π)) – log(σ) – (y – μ)²/(2σ²) ]\) \(= log(σ) + (y – μ)²/(2σ²) + (1/2) log(2π)\)

3. Ensemble Methods

Ensemble methods are empirical techniques (appendix-1), where we train multiple different models, each produces a slightly different output for the same sample.

Then we can use the variance in the outputs to create a confidence interval, in a similar way to the 1st method.

So if we want to generate 95% confidence interval (alpha=0.05), we sort the predictions ascending, and our lower bound is at the 2.5th percentile, upper bound is at the 97.5th percentile.

How do we train multiple different models?

Bootstrapping

Considerations with the ensemble methods:

4. Training a 2nd Model to Predict Variance

This might seem strange at first glance, but if we follow a specific method, it’s possible to train a secondary model that can predict the variance (MAE or RMSE) in the first model. Then we can use this variance to calculate a confidence score for each sample.

  1. So let’s say we have a regression model trained on our training data & a set of features.
  2. We can use this model to predict outputs on a separate dataset (either validation or else).
  3. Now we can use the same set of features & first model’s output as input to the 2nd model, then use this separate dataset to train a model that will predict the MAE or RMSE of the first model.
  1. After training this 2nd model, we can use various formulas (based on our need), to convert this value into a confidence score. A simple formula would be to use the sigmoid:

    \[confidence = 1 / (1 + exp(a - b * (main~model~prediction / (max - min))))\]

    where max is

    \[main~model~prediction + error~model~prediction\]

    and min is

    \[main~model~prediction - error~model~prediction\]

5. Bayesian Neural Networks (BNN)

In a normal neural network, weights are fixed values, while in a BNN, weights are modeled as probability distributions (usually Gaussian distribution where we try to learn mean and standard deviation).

This means there is no single deterministic output from a BNN, we sample from the weight distributions to compute the output.

So output of the model is also a distribution, which can be used to compute a confidence interval.

Considerations with this approach:

Training a true BNN is generally unfeasible, as it requires learning the full posterior distribution. Why?

Can we estimate/approximate the posterior? Yes, there are 2 basic methods:

  1. Variational Inference :

    • Instead of $p(w D)$, introduce a simpler, parameterized distribution $q(w θ)$, which will approximate the posterior.
    • Adjust $θ$ to minimize Kullback-Leibler (KL) divergence between two distributions.
  2. Markov Chain Monte Carlo (MCMC) :

    • Draw samples from true posterior.
    • Construct a Markov chain that has $p(w D)$ as its stationary distribution.
    • After a “burn-in” period, we can treat subsequent samples as representative of the posterior.
    • For large NNs, this method may be slow to converge as well.

Monte Carlo Dropout as a Practical Workaround

Result

Below is a summary table comparing certain aspects of each table, to make the decision process easier.

Method Approach Type Computational Complexity Assumptions
1 - Naive Statistical Method Empirical Very Low Residual distribution is representative
2 - Mean/Standard Deviation Output (MLE Loss) Parametric Low Outputs are normally distributed
3 - Ensemble Methods Empirical High -
4 - Training a 2nd Model to Predict Variance Empirical Moderate -
5 - Bayesian Neural Networks (BNN) Parametric Very High (or Moderate if using MC Dropout) Model weights have a known prior (e.g. Gaussian)

References

  1. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Appendices

1. Empirical vs Parametric

Method Approach Type Explanation
1 - Naive Method Empirical Calculates uncertainty directly from the observed residuals of the training data
2 - Mean/Standard Deviation Output (MLE Loss) Parametric Assumes a specific form (typically a Gaussian distribution) for the prediction error
3 - Ensemble Methods Empirical Variability among these predictions from multiple independent models are used directly to estimate uncertainty
4 - Training a 2nd Model to Predict Variance Empirical Trains a separate model to predict the variance of the first model’s predictions
5 - Bayesian Neural Networks Parametric Places prior distributions on the model weights and computes a posterior distribution based on a chosen parametric family