\(\newcommand{\VS}{\quad \mathrm{VS} \quad}\) \(\newcommand{\and}{\quad \mathrm{and} \quad}\) \(\newcommand{\E}{\mathbb E}\) \(\newcommand{\P}{\mathbb P}\) \(\newcommand{\Var}{\mathbb V}\) \(\newcommand{\1}{\mathbf 1}\)
In practice, we often hesitate between several models:
Several criteria exist. The main ones:
Suppose we have \(p_{\max}\) explanatory variables, forming the “maximal” design matrix \(X_{\max}\).
True model (unknown):
\[Y = X^*\beta^* + \varepsilon\]
where \(X^*\) is a sub-matrix of \(X_{\max}\) formed by \(p^* \leq p_{\max}\) columns.
We don’t know \(p^*\) nor which variables are involved.
Goal: Select the correct matrix \(X^*\) and estimate \(\beta^*\).
We regress \(Y\) on \(p \leq p_{\max}\) variables, assuming: \[Y = X\beta + \varepsilon\] where \(X\): sub-matrix of \(X_{\max}\) containing the \(p\) chosen columns (yielding \(\hat{\beta}\)).
This model is potentially wrong (bad choice of variables).
Objective: Calculate a quality score for this submodel.
For a model with constant:
\[R_a^2 = 1 - \frac{n-1}{n-p} \cdot \frac{SSR}{SST}\]
Selection rule: Between two models, prefer highest \(R_a^2\).
\[F = \frac{n-p}{q} \cdot \frac{SSR_c - SSR}{SSR}\]
If \(F < f_{q,n-p}(1-\alpha)\): prefer sub-model (\(H_0\) at level \(\alpha\))
Mallows’ \(C_p\) aims to estimate the prediction risk: \[\E(\|\tilde{Y} - X\hat{\beta}\|^2)\]
where \(\tilde{Y}\) follows the same distribution as \(Y\) but is independent.
\[C_p = \frac{SSR}{\hat{\sigma}^2} - n + 2p\]
Selection rule: Among all tested models, choose the one with lowest \(C_p\).
AIC (Akaike Information Criterion) is motivated like \(C_p\).
It also focuses on prediction error \(\tilde{Y} - X\hat{\beta}\), but Kullback distance instead of Quadratic distance.
\[AIC = n \ln\left(\frac{SSR}{n}\right) + 2(p+1)\]
Selection rule: choose model with lowest AIC.
In practice, AIC and \(C_p\) are very close (choose same model)
BIC (Bayesian Information Criterion) seeks the “most probable” model in a Bayesian formalism.
\[BIC = n \ln\left(\frac{SSR}{n}\right) + (p+1) \ln n\]
Selection rule: choose the one with lowest BIC.
All previous criteria try to find a compromise between:
This is a permanent trade-off in statistics (not just in regression).
\(C_p\), AIC, and BIC consist of minimizing an expression of the form:
\[f(SSR) + c(n) \cdot p\]
BIC: \(c(n) = \ln n\) \(\quad\quad\) AIC: \(c(n) = 2\)
(Other criteria exist built on the same principle.)
\[f(SSR) + c(n) \cdot p\]
When \(\ln n > 2\), BIC penalizes large models more than AIC.
Ordering criteria by their propensity to select the most sparse model:
\[BIC \leq F\text{ test} \leq C_p \approx AIC \leq R_a^2\]
Probability as \(n \to \infty\) | BIC | \(C_p\), AIC, \(R_a^2\) |
---|---|---|
\(\mathbb{P}\)(selects model smaller than true) | \(\to 0\) | \(\to 0\) |
\(\mathbb{P}\)(selects model larger than true) | \(\to 0\) | \(\not\to 0\) |
\(\mathbb{P}\)(selects correct model) | \(\to 1\) | \(\not\to 1\) |
BIC is asymptotically consistent, while other criteria tend to overfit.
Given \(p_{\max}\) available explanatory variables:
If \(p_{\max}\) is not too large, this remains feasible.
R function: regsubsets
from leaps
library
Important Warning
Automatic selection does not guarantee that the selected model is good.
It’s simply the best model according to the chosen criterion.
The selected model may be bad in terms of:
If \(p_{\max}\) is too large for exhaustive search:
Stepwise Backward (according to chosen criterion, e.g., BIC):
If \(p_{\max}\) is too large for exhaustive search:
Stepwise Forward:
Stepwise Backward (or Forward) Hybrid:
Stepwise procedures do not explore all possible sub-models:
Speed comparison:
Function: step
with option direction
:
"backward"
or "forward"
or "both"
k = 2
)k = ln(n)
The option k
corresponds to the penalty \(c(n)\) introduced earlier.