Validation

\(\newcommand{\VS}{\quad \mathrm{VS} \quad}\) \(\newcommand{\and}{\quad \mathrm{and} \quad}\) \(\newcommand{\E}{\mathbb E}\) \(\newcommand{\P}{\mathbb P}\) \(\newcommand{\Var}{\mathbb V}\) \(\newcommand{\1}{\mathbf 1}\)

Explanatory Quality

Pythagorean Decomposition

Let \(\mathbf{1}\) be the constant column vector in \(\mathbb R^{n\times 1}\).

If \(\mathbf{1} \in [X]\) (eg if we consider an intercept) \[ \underbrace{\|Y-\overline Y \1\|^2}_{SST} = \underbrace{\|Y-\widehat Y\|^2}_{SSR}+\underbrace{\|\widehat Y-\overline Y \1\|^2}_{SSE}\]

In the general case,

\[\|Y\|^2 = \|Y-\widehat Y\|^2 + \|\widehat Y\|^2\]

Good model if sum of squares of residuals \(SSR \ll 1\)

\(R^2\)

\(R^2\) if \(\1 \in [X]\)

\[R^2 = \frac{SSE}{SST} = 1-\frac{SSR}{SST}\]

\(R^2\) if \(\1 \not \in [X]\)

\[R^2 = \frac{\|\widehat Y\|^2}{\|Y\|^2} = 1 - \frac{SCR}{\|Y\|^2}\]

  • \(0 \leq R^2 \leq 1\). Better model if \(R^2\) close to \(1\)
  • Two definitions of \(R^2\) when \(\1 \in [X]\) or not
  • In simple linear regression \((Y_i = \beta_1+\beta_2X_i+\varepsilon_i)\): \(R^2 = \hat \rho^2\) is the square empirical correlation between \(Y\) and \(X\)

Adjusted \(R^2\)

Main flaw of \(R^2\): adding a new variables decreases \(R^2\) (because \([X]\) is a bigger projection space)

\(R^2\) if \(\1 \in [X]\)

\[R^2_a = 1-\frac{n-1}{n-p}\frac{SSR}{SST}\]

\(R^2\) if \(\1 \not \in [X]\)

\[R^2_a = 1 - \frac{n}{n-p}\frac{SCR}{\|Y\|^2}\]

With a new variable, \(SCR\) decreases but \(p \to p+1\)

\(R_a^2\)​ only decreases when adding a new variable if that variable significantly reduces the residual sum of squares. ()

R output interpretation

Residuals:
   Min     1Q Median     3Q    Max 
-8.065 -3.107  0.152  3.495  9.587 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
Girth         5.0659     0.2474   20.48  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared:  0.9353,    Adjusted R-squared:  0.9331 

Here, \(R^2=0.9353\) and \(R^2_a=0.9331\).

\(\approx 93\%\) of the variability is explained by the model.

Testing Linear Constraints on \(\beta\)

Linear Constraints

We want to test q linear constraints on the coefficient vector \(\beta \in \mathbb R^p\).

This is formulated as: \[H_0: R\beta = 0 \quad \text{vs} \quad H_1: R\beta \neq 0\]

where R is a (q × p) constraint matrix encoding the restrictions, with \(q \leq p\).

Common Test Types

\[H_0: R\beta = 0 \quad \text{vs} \quad H_1: R\beta \neq 0\]

  • Student’s t-test: is variable \(j\) significant?
    \(H_0: \beta_j = 0\) vs \(H_1: \beta_j \neq 0\)
  • Global F-test: is any variable significant? Identity matrix excluding intercept
    \(H_0: \beta_2 = \cdots = \beta_p = 0\) vs \(H_1: \exists j \in \{2,\ldots,p\}\) s.t. \(\beta_j \neq 0\)
  • Nested model test: are q variables jointly significant?
    \(H_0: \beta_{p-q+1} = \cdots = \beta_p = 0\) vs \(H_1\): the contrary

Key Applications

  • Individual significance: Testing if a single predictor matters
  • Overall model significance: Testing if the model explains anything beyond the intercept
  • Variable subset significance: Testing if a group of variables contributes to the model

Fisher Test

Theorem

  • \(SSR\): sum of squares of residuals in the unconstrained regression model
  • \(SSR_c\): sum of squares of residuals in the constrained regression model, i.e., in the sub-model satisfying \(R\beta = 0\)

If \(\text{rank}(X) = p\), \(\text{rank}(R)=q\) and \(\varepsilon \sim N(0, \sigma^2 I_n)\), then under \(H_0: R\beta = 0\):

\[F = \frac{n-p}{q} \cdot \frac{{SSR}_c - {SSR}}{{SSR}} \sim F(q, n-p)\]

where \(F(q, n-p)\) denotes the Fisher distribution with \((q, n-p)\) degrees of freedom. Elements of proof

Rejection Region

Key argument: \(SSR_c - SSR\) is equal to \(\|P_{V}Y\|^2\), where \(V=X(Ker(R))^{\perp} \cap [X]\) and \(\text{dim}(V)=q\).

Therefore, the critical region at significance level \(\alpha\) for testing \(H_0: R\beta = 0\) against \(H_1: R\beta \neq 0\) is:

\[RC_\alpha = \{F > f_{q,n-p}(1-\alpha)\}\]

where \(f_{q,n-p}(1-\alpha)\) denotes the \((1-\alpha)\)-quantile of an \(F(q, n-p)\) distribution.

Particular Case 1: Student Test

Fix some variable \(j\) and consider

\(H_0: \beta_j=0\) VS \(H_1:\beta_j\neq 0\)

Only one constraint: \(q=1\), so that

\[ F = (n-p) \frac{SCR_c-SCR}{SCR} \sim \mathcal F(1,n-p) \sim \mathcal T^2(n-p)\]

In fact, Here \(F = \big(\tfrac{\hat \beta_j}{\hat \sigma_{\hat \beta_j}}\big)^2\) so \(F\) is the student test presented before

Particular Case 2: Global Fisher Test

\(H_0: \beta_2= \dots = \beta_p=0\) VS \(H_1\): contrary

\(q = p-1\) in this case . . .

\[F = \frac{n-p}{p-1}\frac{SSE}{SSR} = \frac{n-p}{p-1}\frac{R^2}{1-R^2} \sim \mathcal F(p-1, n-p)\]

Particular Case 3: Nested Fisher Test

\(H_0: \beta_{p-q+1}= \dots = \beta_p=0\) VS \(H_1\): contrary

\[F = \frac{n-p}{q}\frac{SCR_c - SCR}{SCR} \sim \mathcal F(q, n-p)\]

Interpretation:

If \(F \geq f_{q,n-p}(1-\alpha)\) (\(1-\alpha\)-quantile of Fisher dist.) then constraints are not satisfied. We do not accept the submodel with respect to larger model.

Example of R Output


Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
Girth         5.0659     0.2474   20.48  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared:  0.9353,    Adjusted R-squared:  0.9331 
F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

Here, Fisher global significance test statistic is \(F=419.4\), \(q=1\) and \(n-p=29\) (\(n=31\)). pvalue is negligible

Here, \(q=1\) and \(Global Fisher test\) is a Student Test.

Verification of Model Hypotheses

Model Assumptions

The linear regression model relies on the following key assumptions:

  • Model specification: \(Y = X\beta + \varepsilon\) (linear relationship)
  • Full rank design: \(\text{rank}(X) = p\) (no perfect multicollinearity)
  • Zero mean errors: \(\mathbb{E}(\varepsilon) = 0\)
  • Homoscedastic errors: \(\text{Var}(\varepsilon) = \sigma^2 I_n\) (constant variance and uncorrelated errors)

Now: diagnostic tools to verify each assumption and remedial strategies when they fail.

Linearity Assumption

The linearity assumption \(\mathbb{E}(Y) = X\beta\) is the fundamental hypothesis of linear regression.

Diagnostic Tools:

  • Pre-modeling: Scatter plots \((X^{(j)}, Y)\) and empirical correlations for each predictor
  • Post-modeling: Residual analysis - non-linearity manifests as patterns in \(\hat{\varepsilon}\)
  • Advanced: Partial residual plots (not covered here)

Linearity Assumption

The linearity assumption \(\mathbb{E}(Y) = X\beta\) is the fundamental hypothesis of linear regression.

Remedial Strategies:

  • Transformations: Apply transformations to \(Y\) and/or predictors \(X^{(j)}\) to achieve linearity
  • Alternative models: If transformations fail, consider nonlinear regression models

Full Rank Assumption

The condition \(\text{rank}(X) = p\) ensures no predictor \(X^{(j)}\) is a linear combination of others.

Why it matters:

  • Identifiability: Without full rank, \(\beta\) is not uniquely defined
  • Estimation: \((X^TX)\) becomes non-invertible, making \(\hat{\beta} = (X^TX)^{-1}X^TY\) undefined
  • Infinitely many solutions satisfy \(X^TX\hat{\beta} = X^TY\)

Full Rank Assumption

The condition \(\text{rank}(X) = p\) ensures no predictor \(X^{(j)}\) is a linear combination of others.

In practice:

  • Perfect collinearity is rare, so \(\text{rank}(X) = p\) usually holds
  • Near-collinearity is the real concern - when predictors are “almost” linearly dependent

Near-Collinearity Issues

When a variable is highly correlated with others (correlation close to but not exactly \(\pm 1\)):

Mathematical consequences:

  • \(X'X\) remains invertible, but its smallest eigenvalue approaches zero
  • \((X'X)^{-1}\) becomes numerically unstable

Near-Collinearity Issues

Statistical implications:

  • Instability: Adding/removing a single observation can drastically change \((X'X)^{-1}\)
  • Unreliable estimates: \(\hat{\beta} = (X'X)^{-1}X'Y\) becomes highly unstable
  • Inflated variance: \(\text{Var}(\hat{\beta}) = \sigma^2(X'X)^{-1}\) becomes very large

This is undesirable from a statistical point of view

Detecting Collinearity: VIF

Compute the VIF (Variance Inflation Factor) for each \(X^{(j)}\):

  1. Regress \(X^{(j)}\) on all other \(X^{(k)}\) (where \(k \neq j\))
  2. Compute \(R_j^2\) from this regression
  3. Calculate: \(\text{VIF}_j = \frac{1}{1 - R_j^2}\)

Properties:

  • \(\text{VIF}_j \geq 1\) always
  • High VIF indicates collinearity. Common threshold: \(\text{VIF}_j \geq 5\). In R: vif() from car package

Remedies for Multicollinearity

  • Variable removal: Drop variables with high VIF
  • Preferably remove those least correlated with \(Y\)
  • Penalized regression: Ridge, LASSO, or elastic net methods (not covered here)

Important Distinction on Multicollinearity

  • Parameter estimation: Multicollinearity severely affects \(\hat{\beta}\) reliability
  • Prediction: Not problematic: \(\hat{Y}\) remains well-defined and stable since projection on \([X]\) is still unique

Analysis of the Residuals

Recall that \(\hat \varepsilon = Y - \widehat Y = P_{[X]^{\perp}} \varepsilon\)

Residual Properties

  • \(\E(\hat{\varepsilon}) = 0\)
  • \(\Var(\hat{\varepsilon}) = \sigma^2P_{[X]}^{\perp}\)
  • \(\text{Cov}(\hat{\varepsilon}, \hat{Y}) = 0\)
  • If \(\1 \in [X]\): \(\bar{\hat{\varepsilon}} = 0\)

Diagnostic Tools

  1. Graphical Assessment: Visual evaluation of model quality

  2. Homoscedasticity Test: Test: \(\Var(\varepsilon_i) = \sigma^2\) for all \(i\) (constant variance)

  3. Non-correlation Test:
    Test: \(\Var(\varepsilon)\) is diagonal (uncorrelated errors)

  4. Normality Test: Examine normality of residuals

Residual vs. Fitted Plot

The scatter plot between \(\hat{Y}\) and \(\hat{\varepsilon}\) is informative.

Since \(\text{Cov}(\hat{\varepsilon}, \hat{Y}) = 0\), no structure should appear.

If patterns emerge, this may indicate violations of:

  • Linearity assumption
  • Homoscedasticity assumption
  • Non-correlation assumption
  • Or a combination of these…

Homosced. Test (Breusch-Pagan)

We want to test whether \(\Var(\varepsilon_i)=\sigma^2, ~~\forall i\)

Principle: Assume \(\varepsilon_i\) has variance \(\sigma_i^2 = \sigma^2 + z_i^T\gamma\) where:

  • \(z_i\) is a \(k\)-vector of variables that might explain heteroscedasticity (known)
  • Default in R: \(z_i = (X_i^{(1)}, \ldots, X_i^{(p)})\), so \(k = p\)
  • \(\gamma\) is an unknown \(k\)-dimensional parameter

\(H_0: \gamma = 0\) (homosced.) VS \(H_1: \gamma \neq 0\) (heterosced.)

R function: bptest from lmtest library

Consequences of Heteroscedasticity

What happens:

  • OLS estimation of \(\beta\) is no longer optimal but remains consistent
  • Inference tools (tests, confidence intervals) become invalid because they rely on \(\hat{\sigma}^2\) estimation

Solutions for Heteroscedasticity

  • Transformation: Transform \(Y\) (e.g., log transformation) to stabilize variance
  • Modeling: Model heteroscedasticity explicitly and account for it in estimation
  • GLS: Use Generalized Least Squares (not detailed here)

3. Non-correlation Test

Purpose: Test if \(\Var(\varepsilon)\) is diagonal (uncorrelated errors)

Correlation between \(\varepsilon_i\) often occurs with temporal data (index \(i\) represents time)

Auto-correlation Model of order \(r\):

\[\varepsilon_i = \rho_1 \varepsilon_{i-1} + \dots + \rho_r \varepsilon_{i-r} + \eta_i\] where \(\eta_i \sim \text{iid } N(0, \sigma^2)\)

Tests for Correlation

In the auto-correlation model \(\varepsilon_i = \rho_1 \varepsilon_{i-1} + \dots + \rho_r \varepsilon_{i-r} + \eta_i\):

Durbin-Watson Test (for \(r = 1\) only):

  • \(H_0: \rho_1 = 0\) VS \(H_1: \rho_1 \neq 0\)
  • R function: dwtest from lmtest

Breusch-Godfrey Test (for any \(r\)):

  • \(H_0: \rho_1 = \cdots = \rho_r = 0\) VS \(H_1:\) at least one \(\rho_j \neq 0\)
  • User chooses order \(r\) (default \(r = 1\))
  • R function: bgtest from lmtest

Consequences of Auto-correlation

  • OLS estimation of \(\beta\) is no longer optimal but remains consistent
  • Inference tools (tests, confidence intervals) become invalid

Solutions:

  • GLS modeling: Model the dependence structure (complex, risky if wrong)
  • Model improvement: Exploit the dependence to enhance the model
  • Ex: Explain \(Y_i\) using \(Y_{i-1}\) in addition to \((X_i^{(1)}, \dots, X_i^{(p)})\)

4. Normality Test

Purpose: Examine normality of residuals \(\hat \varepsilon\)

Reminder on Normality Assumption

  • Not essential when \(n\) is large
  • All tests remain asymptotically valid
  • Only prediction intervals truly require normality

Why examine it anyway?

If \(\varepsilon \sim N(0, \sigma^2 I_n)\) then \(\hat{\varepsilon} \sim N(0, \sigma^2 P_{[X]}^{\perp})\)

Diagnostic Tools for Normality of \(\hat \varepsilon\)

Q-Q Plot (Henry’s line):

  • Plot theoretical vs. sample quantiles of \(\hat{\varepsilon}\)
  • R function: qqnorm

Shapiro-Wilk, \(\chi^2\) or KS Tests:

  • Formal test of normality for \(\hat{\varepsilon}\)
  • \(H_0\): residuals are normally distributed
  • \(H_1\): residuals are not normally distributed
  • R function: shapiro.test

Outlier Analysis

An individual is atypical when:

  1. Poorly explained by the model, and/or
  2. Heavily influences coefficient estimation

Identify these individuals to:

  • Understand the reason for this particularity
  • Potentially modify the model accordingly
  • Potentially exclude the individual from the study

Poorly Explained Individuals

Individual \(i\) is poorly explained if its residual \(\hat{\varepsilon}_i\) is “abnormally” large.

How to quantify “abnormally”?

Let \(h_{ij}\) be elements of matrix \(P_{[X]}\) (hat matrix).

For a Gaussian model: \(\hat{\varepsilon}_i \sim N(0, (1-h_{ii})\sigma^2)\)

Standardized Residuals

\[t_i = \frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\]

Poorly Explained Individuals

\[h_{ij} = (P_{[X]})_{ij} \and t_i = \frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\]

We expect \(t_i \sim St(n-p)\) (not strictly true since \(\hat{\varepsilon}_i \not\perp \hat{\sigma}^2\))

Definition

Individual \(i\) is considered poorly explained by the model if: \[|t_i| > t_{n-p}(1-\alpha/2)\] for predetermined \(\alpha\), typically \(\alpha = 0.05\), giving \(t_{n-p}(1-\alpha/2) \approx 2\).

Leverage Points

A point is influential if it contributes significantly to \(\hat{\beta}\) estimation.

Leverage value: \(h_{ii}\) corresponds to the weight of \(Y_i\) on its own estimation \(\hat{Y}_i\)

We know that: \(\sum_{i=1}^n h_{ii} = \text{tr}(P_{[X]}) = p\)

Therefore, on average: \(h_{ii} \approx p/n\)

Definition

Individual \(i\) is called a leverage point if \(h_{ii} \gg p/n\)

Typically: \(h_{ii} > 2p/n\) or \(h_{ii} > 3p/n\)

Outlier Analysis: Cook’s Distance

Cook’s Distance

Quantifies the influence of individual \(i\) on \(\hat{Y}\):

\[C_i = \frac{\|\hat{Y} - \hat{Y}_{(-i)}\|^2}{p\hat{\sigma}^2}\]

where \(\hat{Y}_{(-i)} = X\hat{\beta}_{(-i)}\) with \(\hat{\beta}_{(-i)}\): estimation of \(\beta\) without individual \(i\)

Cook’s Distance, Alternative Formula

\[C_i = \frac{1}{p} \cdot \frac{h_{ii}}{1-h_{ii}} \cdot t_i^2,\]

where \(t_i = \frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\).

This formula shows that Cook’s distance \(C_i\) combines:

  1. Aberrant effect of individual (through \(t_i\))
  2. Leverage effect (through \(h_{ii}\))

R functions: cooks.distance and last plot of plot.lm