\(\newcommand{\VS}{\quad \mathrm{VS} \quad}\) \(\newcommand{\and}{\quad \mathrm{and} \quad}\) \(\newcommand{\E}{\mathbb E}\) \(\newcommand{\P}{\mathbb P}\) \(\newcommand{\Var}{\mathbb V}\) \(\newcommand{\1}{\mathbf 1}\)
Let \(\mathbf{1}\) be the constant column vector in \(\mathbb R^{n\times 1}\).
If \(\mathbf{1} \in [X]\) (eg if we consider an intercept) \[ \underbrace{\|Y-\overline Y \1\|^2}_{SST} = \underbrace{\|Y-\widehat Y\|^2}_{SSR}+\underbrace{\|\widehat Y-\overline Y \1\|^2}_{SSE}\]
In the general case,
\[\|Y\|^2 = \|Y-\widehat Y\|^2 + \|\widehat Y\|^2\]
Good model if sum of squares of residuals \(SSR \ll 1\)
\(R^2\) if \(\1 \in [X]\)
\[R^2 = \frac{SSE}{SST} = 1-\frac{SSR}{SST}\]
\(R^2\) if \(\1 \not \in [X]\)
\[R^2 = \frac{\|\widehat Y\|^2}{\|Y\|^2} = 1 - \frac{SCR}{\|Y\|^2}\]
Main flaw of \(R^2\): adding a new variables decreases \(R^2\) (because \([X]\) is a bigger projection space)
\(R^2\) if \(\1 \in [X]\)
\[R^2_a = 1-\frac{n-1}{n-p}\frac{SSR}{SST}\]
\(R^2\) if \(\1 \not \in [X]\)
\[R^2_a = 1 - \frac{n}{n-p}\frac{SCR}{\|Y\|^2}\]
With a new variable, \(SCR\) decreases but \(p \to p+1\)
\(R_a^2\) only decreases when adding a new variable if that variable significantly reduces the residual sum of squares. ()
Residuals:
Min 1Q Median 3Q Max
-8.065 -3.107 0.152 3.495 9.587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
Girth 5.0659 0.2474 20.48 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
Here, \(R^2=0.9353\) and \(R^2_a=0.9331\).
\(\approx 93\%\) of the variability is explained by the model.
We want to test q linear constraints on the coefficient vector \(\beta \in \mathbb R^p\).
This is formulated as: \[H_0: R\beta = 0 \quad \text{vs} \quad H_1: R\beta \neq 0\]
where R is a (q × p) constraint matrix encoding the restrictions, with \(q \leq p\).
\[H_0: R\beta = 0 \quad \text{vs} \quad H_1: R\beta \neq 0\]
Theorem
If \(\text{rank}(X) = p\), \(\text{rank}(R)=q\) and \(\varepsilon \sim N(0, \sigma^2 I_n)\), then under \(H_0: R\beta = 0\):
\[F = \frac{n-p}{q} \cdot \frac{{SSR}_c - {SSR}}{{SSR}} \sim F(q, n-p)\]
where \(F(q, n-p)\) denotes the Fisher distribution with \((q, n-p)\) degrees of freedom. Elements of proof
Key argument: \(SSR_c - SSR\) is equal to \(\|P_{V}Y\|^2\), where \(V=X(Ker(R))^{\perp} \cap [X]\) and \(\text{dim}(V)=q\).
Therefore, the critical region at significance level \(\alpha\) for testing \(H_0: R\beta = 0\) against \(H_1: R\beta \neq 0\) is:
\[RC_\alpha = \{F > f_{q,n-p}(1-\alpha)\}\]
where \(f_{q,n-p}(1-\alpha)\) denotes the \((1-\alpha)\)-quantile of an \(F(q, n-p)\) distribution.
Fix some variable \(j\) and consider
\(H_0: \beta_j=0\) VS \(H_1:\beta_j\neq 0\)
Only one constraint: \(q=1\), so that
\[ F = (n-p) \frac{SCR_c-SCR}{SCR} \sim \mathcal F(1,n-p) \sim \mathcal T^2(n-p)\]
In fact, Here \(F = \big(\tfrac{\hat \beta_j}{\hat \sigma_{\hat \beta_j}}\big)^2\) so \(F\) is the student test presented before
\(H_0: \beta_2= \dots = \beta_p=0\) VS \(H_1\): contrary
\(q = p-1\) in this case . . .
\[F = \frac{n-p}{p-1}\frac{SSE}{SSR} = \frac{n-p}{p-1}\frac{R^2}{1-R^2} \sim \mathcal F(p-1, n-p)\]
\(H_0: \beta_{p-q+1}= \dots = \beta_p=0\) VS \(H_1\): contrary
\[F = \frac{n-p}{q}\frac{SCR_c - SCR}{SCR} \sim \mathcal F(q, n-p)\]
Interpretation:
If \(F \geq f_{q,n-p}(1-\alpha)\) (\(1-\alpha\)-quantile of Fisher dist.) then constraints are not satisfied. We do not accept the submodel with respect to larger model.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
Girth 5.0659 0.2474 20.48 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
Here, Fisher global significance test statistic is \(F=419.4\), \(q=1\) and \(n-p=29\) (\(n=31\)). pvalue is negligible
Here, \(q=1\) and \(Global Fisher test\) is a Student Test.
The linear regression model relies on the following key assumptions:
Now: diagnostic tools to verify each assumption and remedial strategies when they fail.
The linearity assumption \(\mathbb{E}(Y) = X\beta\) is the fundamental hypothesis of linear regression.
Diagnostic Tools:
The linearity assumption \(\mathbb{E}(Y) = X\beta\) is the fundamental hypothesis of linear regression.
Remedial Strategies:
The condition \(\text{rank}(X) = p\) ensures no predictor \(X^{(j)}\) is a linear combination of others.
Why it matters:
The condition \(\text{rank}(X) = p\) ensures no predictor \(X^{(j)}\) is a linear combination of others.
In practice:
When a variable is highly correlated with others (correlation close to but not exactly \(\pm 1\)):
Mathematical consequences:
Statistical implications:
This is undesirable from a statistical point of view
Compute the VIF (Variance Inflation Factor) for each \(X^{(j)}\):
Properties:
vif()
from car
packageImportant Distinction on Multicollinearity
Recall that \(\hat \varepsilon = Y - \widehat Y = P_{[X]^{\perp}} \varepsilon\)
Graphical Assessment: Visual evaluation of model quality
Homoscedasticity Test: Test: \(\Var(\varepsilon_i) = \sigma^2\) for all \(i\) (constant variance)
Non-correlation Test:
Test: \(\Var(\varepsilon)\) is diagonal (uncorrelated errors)
Normality Test: Examine normality of residuals
The scatter plot between \(\hat{Y}\) and \(\hat{\varepsilon}\) is informative.
Since \(\text{Cov}(\hat{\varepsilon}, \hat{Y}) = 0\), no structure should appear.
If patterns emerge, this may indicate violations of:
We want to test whether \(\Var(\varepsilon_i)=\sigma^2, ~~\forall i\)
Principle: Assume \(\varepsilon_i\) has variance \(\sigma_i^2 = \sigma^2 + z_i^T\gamma\) where:
\(H_0: \gamma = 0\) (homosced.) VS \(H_1: \gamma \neq 0\) (heterosced.)
R function: bptest
from lmtest
library
What happens:
Purpose: Test if \(\Var(\varepsilon)\) is diagonal (uncorrelated errors)
Correlation between \(\varepsilon_i\) often occurs with temporal data (index \(i\) represents time)
Auto-correlation Model of order \(r\):
\[\varepsilon_i = \rho_1 \varepsilon_{i-1} + \dots + \rho_r \varepsilon_{i-r} + \eta_i\] where \(\eta_i \sim \text{iid } N(0, \sigma^2)\)
In the auto-correlation model \(\varepsilon_i = \rho_1 \varepsilon_{i-1} + \dots + \rho_r \varepsilon_{i-r} + \eta_i\):
Durbin-Watson Test (for \(r = 1\) only):
dwtest
from lmtest
Breusch-Godfrey Test (for any \(r\)):
bgtest
from lmtest
Solutions:
Purpose: Examine normality of residuals \(\hat \varepsilon\)
Reminder on Normality Assumption
Why examine it anyway?
If \(\varepsilon \sim N(0, \sigma^2 I_n)\) then \(\hat{\varepsilon} \sim N(0, \sigma^2 P_{[X]}^{\perp})\)
Q-Q Plot (Henry’s line):
qqnorm
Shapiro-Wilk, \(\chi^2\) or KS Tests:
shapiro.test
An individual is atypical when:
Identify these individuals to:
Individual \(i\) is poorly explained if its residual \(\hat{\varepsilon}_i\) is “abnormally” large.
How to quantify “abnormally”?
Let \(h_{ij}\) be elements of matrix \(P_{[X]}\) (hat matrix).
For a Gaussian model: \(\hat{\varepsilon}_i \sim N(0, (1-h_{ii})\sigma^2)\)
Standardized Residuals
\[t_i = \frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\]
\[h_{ij} = (P_{[X]})_{ij} \and t_i = \frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\]
We expect \(t_i \sim St(n-p)\) (not strictly true since \(\hat{\varepsilon}_i \not\perp \hat{\sigma}^2\))
Definition
Individual \(i\) is considered poorly explained by the model if: \[|t_i| > t_{n-p}(1-\alpha/2)\] for predetermined \(\alpha\), typically \(\alpha = 0.05\), giving \(t_{n-p}(1-\alpha/2) \approx 2\).
A point is influential if it contributes significantly to \(\hat{\beta}\) estimation.
Leverage value: \(h_{ii}\) corresponds to the weight of \(Y_i\) on its own estimation \(\hat{Y}_i\)
We know that: \(\sum_{i=1}^n h_{ii} = \text{tr}(P_{[X]}) = p\)
Therefore, on average: \(h_{ii} \approx p/n\)
Definition
Individual \(i\) is called a leverage point if \(h_{ii} \gg p/n\)
Typically: \(h_{ii} > 2p/n\) or \(h_{ii} > 3p/n\)
Cook’s Distance
Quantifies the influence of individual \(i\) on \(\hat{Y}\):
\[C_i = \frac{\|\hat{Y} - \hat{Y}_{(-i)}\|^2}{p\hat{\sigma}^2}\]
where \(\hat{Y}_{(-i)} = X\hat{\beta}_{(-i)}\) with \(\hat{\beta}_{(-i)}\): estimation of \(\beta\) without individual \(i\)
\[C_i = \frac{1}{p} \cdot \frac{h_{ii}}{1-h_{ii}} \cdot t_i^2,\]
where \(t_i = \frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\).
This formula shows that Cook’s distance \(C_i\) combines:
R functions: cooks.distance
and last plot of plot.lm