Validation

\(\newcommand{\VS}{\quad \mathrm{VS} \quad}\) \(\newcommand{\and}{\quad \mathrm{and} \quad}\) \(\newcommand{\E}{\mathbb E}\) \(\newcommand{\P}{\mathbb P}\) \(\newcommand{\Var}{\mathbb V}\) \(\newcommand{\1}{\mathbf 1}\)

Explanatory Quality

Pythagorean Decomposition

Let \(\mathbf{1}\) be the constant column vector in \(\mathbb R^{n\times 1}\).

If \(\mathbf{1} \in [X]\) (eg if we consider an intercept) \[ \underbrace{\|Y-\overline Y \1\|^2}_{SST} = \underbrace{\|Y-\widehat Y\|^2}_{SSR}+\underbrace{\|\widehat Y-\overline Y \1\|^2}_{SSE}\]

In the general case,

\[\|Y\|^2 = \|Y-\widehat Y\|^2 + \|\widehat Y\|^2\]

Good model if sum of squares of residuals \(SSR \ll 1\)

\(R^2\)

\(R^2\) if \(\1 \in [X]\)

\[R^2 = \frac{SSE}{SST} = 1-\frac{SSR}{SST}\]

\(R^2\) if \(\1 \not \in [X]\)

\[R^2 = \frac{\|\widehat Y\|^2}{\|Y\|^2} = 1 - \frac{SSR}{\|Y\|^2}\]

\(0 \leq R^2 \leq 1\). Better model if \(R^2\) close to \(1\)
Two definitions of \(R^2\) when \(\1 \in [X]\) or not
In simple linear regression \((Y_i = \beta_1+\beta_2X_i+\varepsilon_i)\): \(R^2 = \hat \rho^2\) is the square empirical correlation between \(Y\) and \(X\)

Adjusted \(R^2\)

Main flaw of \(R^2\): adding a new variables decreases \(R^2\) (because \([X]\) is a bigger projection space)

\(R_a^2\) if \(\1 \in [X]\)

\[R^2_a = 1-\frac{n-1}{n-p}\frac{SSR}{SST}\]

\(R_a^2\) if \(\1 \not \in [X]\)

\[R^2_a = 1 - \frac{n}{n-p}\frac{SSR}{\|Y\|^2}\]

With a new variable, \(SSR\) decreases but \(p \to p+1\)

\(R_a^2\) only decreases when adding a new variable if that variable significantly reduces the residual sum of squares.

R output interpretation

Residuals:
   Min     1Q Median     3Q    Max 
-8.065 -3.107  0.152  3.495  9.587 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
Girth         5.0659     0.2474   20.48  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared:  0.9353,    Adjusted R-squared:  0.9331

Here, \(R^2=0.9353\) and \(R^2_a=0.9331\).

\(\approx 93\%\) of the variability is explained by the model.

Testing Linear Constraints on \(\beta\)

Linear Constraints

We want to test q linear constraints on the coefficient vector \(\beta \in \mathbb R^p\).

This is formulated as:

\[H_0: R\beta = 0 \quad \text{vs} \quad H_1: R\beta \neq 0\]

where R is a \((q × p)\) constraint matrix encoding the restrictions, with \(q \leq p\).

Common Test Types

\(H_0: R\beta = 0 \quad \text{vs} \quad H_1: R\beta \neq 0\)

Student’s t-test: is variable \(j\) significant?
\(H_0: \beta_j = 0\) vs \(H_1: \beta_j \neq 0\)
Global F-test: is any variable significant? Identity matrix excluding intercept
\(H_0: \beta_2 = \cdots = \beta_p = 0\) vs \(H_1: \exists j \in \{2,\ldots,p\}\) s.t. \(\beta_j \neq 0\)
Nested model test: are q variables jointly significant?
\(H_0: \beta_{p-q+1} = \cdots = \beta_p = 0\) vs \(H_1\): the contrary

Key Applications

Individual significance: Testing if a single predictor matters
Overall model significance: Testing if the model explains anything beyond the intercept
Variable subset significance: Testing if a group of variables contributes to the model

Fisher Test

Theorem

\(SSR\): sum of squares of residuals in the unconstrained regression model
\(SSR_c\): sum of squares of residuals in the constrained regression model, i.e., in the sub-model satisfying \(R\beta = 0\)

If \(\text{rank}(X) = p\), \(\text{rank}(R)=q\) and \(\varepsilon \sim N(0, \sigma^2 I_n)\), then under \(H_0: R\beta = 0\):

\[F = \frac{n-p}{q} \cdot \frac{{SSR}_c - {SSR}}{{SSR}} \sim F(q, n-p)\]

where \(F(q, n-p)\) denotes the Fisher distribution with \((q, n-p)\) degrees of freedom. Elements of proof

Rejection Region

Key argument: \(SSR_c - SSR\) is equal to \(\|(P_{V_0^\perp}-P_{[X]^\perp})Y\|^2\), where \(V_0=X(Ker(R))\) and \(\text{dim}(V_0)=q\).

Therefore, the critical region at significance level \(\alpha\) for testing \(H_0: R\beta = 0\) against \(H_1: R\beta \neq 0\) is:

\[RC_\alpha = \{F > f_{q,n-p}(1-\alpha)\}\]

where \(f_{q,n-p}(1-\alpha)\) denotes the \((1-\alpha)\)-quantile of an \(F(q, n-p)\) distribution.

Particular Case 1: Student Test

Fix some variable \(j\) and consider

\(H_0: \beta_j=0\) VS \(H_1:\beta_j\neq 0\)

Only one constraint: \(q=1\), so that

\[ F = (n-p) \frac{SSR_c-SSR}{SSR} \sim \mathcal F(1,n-p) \sim \mathcal T^2(n-p)\]

In fact, Here \(F = \big(\tfrac{\hat \beta_j}{\hat \sigma_{\hat \beta_j}}\big)^2\) so \(F\) is the squared student test statistic presented before

Particular Case 2: Global Fisher Test

\(H_0: \beta_2= \dots = \beta_p=0\) VS \(H_1\): contrary

\(q = p-1\) in this case

\[F = \frac{n-p}{p-1}\frac{SSE}{SSR} = \frac{n-p}{p-1}\frac{R^2}{1-R^2} \sim \mathcal F(p-1, n-p)\]

Particular Case 3: Nested Fisher Test

\(H_0: \beta_{p-q+1}= \dots = \beta_p=0\) VS \(H_1\): contrary

\[F = \frac{n-p}{q}\frac{SSR_c - SSR}{SSR} \sim \mathcal F(q, n-p)\]

Interpretation:

If \(F \geq f_{q,n-p}(1-\alpha)\) (\(1-\alpha\)-quantile of Fisher dist.) then constraints are not satisfied. We do not accept the submodel with respect to larger model.

Example of R Output


Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
Girth         5.0659     0.2474   20.48  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared:  0.9353,    Adjusted R-squared:  0.9331 
F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

Here, Fisher global significance test statistic is \(F=419.4\), \(q=1\) and \(n-p=29\) (\(n=31\)). pvalue is negligible

Here, \(q=1\) and Global Fisher test is a Student Test.

Verification of Model Hypotheses

Model Assumptions

The linear regression model relies on the following key assumptions:

Model specification: \(Y = X\beta + \varepsilon\) (linear relationship)
Full rank design: \(\text{rank}(X) = p\) (no perfect multicollinearity)
Zero mean errors: \(\mathbb{E}(\varepsilon) = 0\)
Homoscedastic errors: \(\text{Var}(\varepsilon) = \sigma^2 I_n\) (constant variance and uncorrelated errors)

Now: diagnostic tools to verify each assumption and remedial strategies when they fail.

Linearity Assumption

The linearity assumption \(\mathbb{E}(Y) = X\beta\) is the fundamental hypothesis of linear regression.

Diagnostic Tools:

Pre-modeling: Scatter plots \((X^{(j)}, Y)\) and empirical correlations for each predictor
Post-modeling: Residual analysis - non-linearity manifests as patterns in \(\hat{\varepsilon}\)
Advanced: Partial residual plots (not covered here)

Linearity Assumption

The linearity assumption \(\mathbb{E}(Y) = X\beta\) is the fundamental hypothesis of linear regression.

Remedial Strategies:

Transformations: Apply transformations to \(Y\) and/or predictors \(X^{(j)}\) to achieve linearity
Alternative models: If transformations fail, consider nonlinear regression models

Full Rank Assumption

The condition \(\text{rank}(X) = p\) ensures no predictor \(X^{(j)}\) is a linear combination of others.

Why it matters:

Identifiability: Without full rank, \(\beta\) is not uniquely defined
Estimation: \((X^TX)\) becomes non-invertible, making \(\hat{\beta} = (X^TX)^{-1}X^TY\) undefined
Infinitely many solutions satisfy \(X^TX\hat{\beta} = X^TY\)

Full Rank Assumption

The condition \(\text{rank}(X) = p\) ensures no predictor \(X^{(j)}\) is a linear combination of others.

In practice:

Perfect collinearity is rare, so \(\text{rank}(X) = p\) usually holds
Near-collinearity is the real concern - when predictors are “almost” linearly dependent

Near-Collinearity Issues

When a variable is highly correlated with others (correlation close to but not exactly \(\pm 1\)):

Mathematical consequences:

\(X^TX\) remains invertible, but its smallest eigenvalue approaches zero
\((X^TX)^{-1}\) becomes numerically unstable

Near-Collinearity Issues

Statistical implications:

Instability: Adding/removing a single observation can drastically change \((X^TX)^{-1}\)
Unreliable estimates: \(\hat{\beta} = (X^TX)^{-1}X^TY\) becomes highly unstable
Inflated variance: \(\text{Var}(\hat{\beta}) = \sigma^2(X^TX)^{-1}\) becomes very large

This is undesirable from a statistical point of view

Detecting Collinearity: VIF

Compute the VIF (Variance Inflation Factor) for each \(X^{(j)}\):

Regress \(X^{(j)}\) on all other \(X^{(k)}\) (where \(k \neq j\))
Compute \(R_j^2\) from this regression
Calculate: \(\text{VIF}_j = \frac{1}{1 - R_j^2}\)

Properties:

\(\text{VIF}_j \geq 1\) always
High VIF indicates collinearity. Common threshold: \(\text{VIF}_j \geq 5\). In R: vif() from car package

Remedies for Multicollinearity

Variable removal: Drop variables with high VIF
Preferably remove those least correlated with \(Y\)
Penalized regression: Ridge, LASSO, or elastic net methods (not covered here)

Important Distinction on Multicollinearity

Parameter estimation: Multicollinearity severely affects \(\hat{\beta}\) reliability
Prediction: Not problematic: \(\hat{Y}\) remains well-defined and stable since projection on \([X]\) is still unique

Analysis of the Residuals

Recall that \(\hat \varepsilon = Y - \widehat Y = P_{[X]^{\perp}} \varepsilon\)

Residual Properties

\(\E(\hat{\varepsilon}) = 0\)
\(\Var(\hat{\varepsilon}) = \sigma^2P_{[X]}^{\perp}\)
\(\text{Cov}(\hat{\varepsilon}, \hat{Y}) = 0\)
If \(\1 \in [X]\): \(\bar{\hat{\varepsilon}} = 0\)

Diagnostic Tools

Graphical Assessment: Visual evaluation of model quality
Homoscedasticity Test: Test: \(\Var(\varepsilon_i) = \sigma^2\) for all \(i\) (constant variance)
Non-correlation Test:
Test: \(\Var(\varepsilon)\) is diagonal (uncorrelated errors)
Normality Test: Examine normality of residuals

Residual vs. Fitted Plot

The scatter plot between \(\hat{Y}\) and \(\hat{\varepsilon}\) is informative.

Since \(\text{Cov}(\hat{\varepsilon}, \hat{Y}) = 0\), no structure should appear.

If patterns emerge, this may indicate violations of:

Linearity assumption
Homoscedasticity assumption
Non-correlation assumption
Or a combination of these…

Homosced. Test (Breusch-Pagan)

We want to test whether \(\Var(\varepsilon_i)=\sigma^2, ~~\forall i\)

Principle: Assume \(\varepsilon_i\) has variance \(\sigma_i^2 = \sigma^2 + z_i^T\gamma\) where:

\(z_i\) is a \(k\)-vector of variables that might explain heteroscedasticity (known)
Default in R: \(z_i = (X_i^{(1)}, \ldots, X_i^{(p)})\), so \(k = p\)
\(\gamma\) is an unknown \(k\)-dimensional parameter

\(H_0: \gamma = 0\) (homosced.) VS \(H_1: \gamma \neq 0\) (heterosced.)

R function: bptest from lmtest library

Consequences of Heteroscedasticity

What happens:

OLS estimation of \(\beta\) is no longer optimal but remains consistent
Inference tools (tests, confidence intervals) become invalid because they rely on \(\hat{\sigma}^2\) estimation

Solutions for Heteroscedasticity

Transformation: Transform \(Y\) (e.g., log transformation) to stabilize variance
Modeling: Model heteroscedasticity explicitly and account for it in estimation
GLS: Use Generalized Least Squares (not detailed here)

3. Non-correlation Test

Purpose: Test if \(\Var(\varepsilon)\) is diagonal (uncorrelated errors)

Correlation between \(\varepsilon_i\) often occurs with temporal data (index \(i\) represents time)

Auto-correlation Model of order \(r\):

\[\varepsilon_i = \rho_1 \varepsilon_{i-1} + \dots + \rho_r \varepsilon_{i-r} + \eta_i\] where \(\eta_i \sim \text{iid } N(0, \sigma^2)\)

Tests for Correlation

In the auto-correlation model \(\varepsilon_i = \rho_1 \varepsilon_{i-1} + \dots + \rho_r \varepsilon_{i-r} + \eta_i\):

Durbin-Watson Test (for \(r = 1\) only):

\(H_0: \rho_1 = 0\) VS \(H_1: \rho_1 \neq 0\)
R function: dwtest from lmtest

Breusch-Godfrey Test (for any \(r\)):

\(H_0: \rho_1 = \cdots = \rho_r = 0\) VS \(H_1:\) at least one \(\rho_j \neq 0\)
User chooses order \(r\) (default \(r = 1\))
R function: bgtest from lmtest

Consequences of Auto-correlation

OLS estimation of \(\beta\) is no longer optimal but remains consistent
Inference tools (tests, confidence intervals) become invalid

Solutions:

GLS modeling: Model the dependence structure (complex, risky if wrong)
Model improvement: Exploit the dependence to enhance the model
Ex: Explain \(Y_i\) using \(Y_{i-1}\) in addition to \((X_i^{(1)}, \dots, X_i^{(p)})\)

4. Normality Test

Purpose: Examine normality of residuals \(\hat \varepsilon\)

Reminder on Normality Assumption

Not essential when \(n\) is large
All tests remain asymptotically valid
Only prediction intervals truly require normality

Perharps the least important assumption, but we can still examine it since

If \(\varepsilon \sim N(0, \sigma^2 I_n)\) then \(\hat{\varepsilon} \sim N(0, \sigma^2 P_{[X]}^{\perp})\)

Diagnostic Tools for Normality of \(\hat \varepsilon\)

Q-Q Plot (Henry’s line):

Plot theoretical vs. sample quantiles of \(\hat{\varepsilon}\)
R function: qqnorm

Shapiro-Wilk, \(\chi^2\) or KS Tests:

Formal test of normality for \(\hat{\varepsilon}\)
\(H_0\): residuals are normally distributed
\(H_1\): residuals are not normally distributed
R function: shapiro.test

Outlier Analysis

An individual is atypical when:

Poorly explained by the model, and/or
Heavily influences coefficient estimation

Identify these individuals to:

Understand the reason for this particularity
Potentially modify the model accordingly
Potentially exclude the individual from the study

Poorly Explained Individuals

Individual \(i\) is poorly explained if its residual \(\hat{\varepsilon}_i\) is “abnormally” large.

How to quantify “abnormally”?

Let \(h_{ij}\) be elements of matrix \(P_{[X]}\) (hat matrix).

For a Gaussian model: \(\hat{\varepsilon}_i \sim N(0, (1-h_{ii})\sigma^2)\)

Standardized Residuals

\[t_i = \frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\]

Poorly Explained Individuals

\[h_{ij} = (P_{[X]})_{ij} \and t_i = \frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\]

We expect \(t_i \sim St(n-p)\) (not strictly true since \(\hat{\varepsilon}_i \not\perp \hat{\sigma}^2\))

Definition

Individual \(i\) is considered poorly explained by the model if: \[|t_i| > t_{n-p}(1-\alpha/2)\] for predetermined \(\alpha\), typically \(\alpha = 0.05\), giving \(t_{n-p}(1-\alpha/2) \approx 2\).

Leverage Points

A point is influential if it contributes significantly to \(\hat{\beta}\) estimation.

Leverage value: \(h_{ii}\) corresponds to the weight of \(Y_i\) on its own estimation \(\hat{Y}_i\)

We know that: \(\sum_{i=1}^n h_{ii} = \text{tr}(P_{[X]}) = p\)

Therefore, on average: \(h_{ii} \approx p/n\)

Definition

Individual \(i\) is called a leverage point if \(h_{ii} \gg p/n\)

Typically: \(h_{ii} > 2p/n\) or \(h_{ii} > 3p/n\)

Outlier Analysis: Cook’s Distance

Cook’s Distance

Quantifies the influence of individual \(i\) on \(\hat{Y}\):

\[C_i = \frac{\|\hat{Y} - \hat{Y}_{(-i)}\|^2}{p\hat{\sigma}^2}\]

where \(\hat{Y}_{(-i)} = X\hat{\beta}_{(-i)}\) with \(\hat{\beta}_{(-i)}\): estimation of \(\beta\) without individual \(i\)

Cook’s Distance, Alternative Formula

\[C_i = \frac{1}{p} \cdot \frac{h_{ii}}{1-h_{ii}} \cdot t_i^2,\]

where \(t_i = \frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\).

This formula shows that Cook’s distance \(C_i\) combines:

Aberrant effect of individual (through \(t_i\))
Leverage effect (through \(h_{ii}\))

R functions: cooks.distance and last plot of plot.lm

Selection of Models

Validation

Explanatory Quality

Pythagorean Decomposition

\(R^2\)

Adjusted \(R^2\)

R output interpretation

Testing Linear Constraints on \(\beta\)

Linear Constraints

Common Test Types

Key Applications

Fisher Test

Rejection Region

Particular Case 1: Student Test

Particular Case 2: Global Fisher Test

Particular Case 3: Nested Fisher Test

Example of R Output

Verification of Model Hypotheses

Model Assumptions

Linearity Assumption

Linearity Assumption

Full Rank Assumption

Full Rank Assumption

Near-Collinearity Issues

Near-Collinearity Issues

Detecting Collinearity: VIF

Remedies for Multicollinearity

Analysis of the Residuals

Residual Properties

Diagnostic Tools

Residual vs. Fitted Plot

Homosced. Test (Breusch-Pagan)

Consequences of Heteroscedasticity

Solutions for Heteroscedasticity

3. Non-correlation Test

Tests for Correlation

Consequences of Auto-correlation

4. Normality Test

Diagnostic Tools for Normality of \(\hat \varepsilon\)

Outlier Analysis

Poorly Explained Individuals

Poorly Explained Individuals

Leverage Points

Outlier Analysis: Cook’s Distance

Cook’s Distance, Alternative Formula

Next