Linear Regression Model

AI was used to assist with the formatting and writing of the proofs on this page.

Gauss Markov

\(\newcommand{\VS}{\quad \mathrm{VS} \quad}\) \(\newcommand{\and}{\quad \mathrm{and} \quad}\) \(\newcommand{\E}{\mathbb E}\) \(\newcommand{\P}{\mathbb P}\) \(\newcommand{\Var}{\mathbb V}\)

Gauss-Markov

Under the same assumptions, if \(\tilde \beta\) is another linear and unbiased estimator then \[\mathbb V(\hat \beta) \preceq \mathbb V(\tilde \beta),\]

where \(A\preceq B\) means that \(B-A\) is a symmetric positive semidefinite matrix

Proof of the Gauss-Markov Theorem

Setup

Let \(\tilde{\beta} = CY\) be any linear unbiased estimator of \(\beta\), where \(C\) is an \(n \times p\) matrix of constants.

Step 1: Unbiasedness Constraint

Since \(\tilde{\beta}\) is unbiased: \(\E[\tilde{\beta}] = \beta\) for all \(\beta\)

\[\E[\tilde{\beta}] = \E[CY] = \E[C(X\beta + \varepsilon)]\]

\[= CX\beta + C\E[\varepsilon] = CX\beta\]

For unbiasedness: \(CX\beta = \beta\) for all \(\beta\)

Therefore: \(CX = I\) (the \(p \times p\) identity matrix)

Step 2: Express Any Linear Unbiased Estimator

Since \(CX = I\), we can write: \[C = (X^TX)^{-1}X^T + D\]

where \(D\) is any matrix satisfying \(DX = 0\).

Verification: \((X^TX)^{-1}X^TX + DX = I + 0 = I\)

Step 3: Express the Estimator

\[\tilde{\beta} = CY = [(X^TX)^{-1}X^T + D]Y\]

\[= (X^TX)^{-1}X^TY + DY\]

\[= \hat{\beta} + DY\]

Step 4: Calculate Variance

\[\text{Var}(\tilde{\beta}) = \text{Var}(\hat{\beta} + DY)\]

\[= \text{Var}(\hat{\beta}) + \text{Var}(DY) + 2\text{Cov}(\hat{\beta}, DY)\]

Step 5: Show Covariance Term is Zero

\[\text{Cov}(\hat{\beta}, DY) = \text{Cov}((X^TX)^{-1}X^TY, DY)\]

\[= (X^TX)^{-1}X^T \text{Cov}(Y, Y) D^T\]

\[= (X^TX)^{-1}X^T (\sigma^2I) D^T\]

\[= \sigma^2(X^TX)^{-1}X^TD^T\]

Since \(DX = 0\), we have \(X^TD^T = 0\), therefore: \[\text{Cov}(\hat{\beta}, DY) = 0\]

Step 6: Final Comparison

\[\text{Var}(\tilde{\beta}) = \text{Var}(\hat{\beta}) + \text{Var}(DY)\]

\[= \sigma^2(X^TX)^{-1} + \sigma^2DD^T\]

Since \(DD^T \succeq 0\) (positive semidefinite), we have:

\[\text{Var}(\tilde{\beta}) - \text{Var}(\hat{\beta}) = \sigma^2DD^T \succeq 0\]

Conclusion

This proves that \(\text{Var}(\hat{\beta}) \preceq \text{Var}(\tilde{\beta})\) in the matrix sense, establishing that the OLS estimator \(\hat{\beta}\) has minimum variance among all linear unbiased estimators. □

Maximum Likelihood Estimators for Linear Regression

Maximum Likelihood Estimator

MLE

Let \(\hat \beta_{MLE}\) and \(\hat \sigma_{MLE}^2\) be the MLE of \(\beta\) and \(\sigma^2\), respectively.

  • \(\hat{\beta}_{MLE} = \hat{\beta}\) et \(\hat{\sigma}^2_{MLE} = \frac{SCR}{n} = \frac{n-p}{n} \hat{\sigma}^2\).

  • \(\hat{\beta} \sim N(\beta, \sigma^2(X^TX)^{-1})\).

  • \(\frac{n-p}{\sigma^2} \hat{\sigma}^2 = \frac{n}{\sigma^2} \hat{\sigma}^2_{MLE} \sim \chi^2(n - p)\).

  • \(\hat{\beta}\) and \(\hat{\sigma}^2\) are independent

Proof

Setup

Model: \(Y = X\beta + \varepsilon\) where \(\varepsilon \sim N(0, \sigma^2 I)\)

This means: \(Y \sim N(X\beta, \sigma^2 I)\)

Likelihood Function

For \(n\) observations, the likelihood function is:

\[L(\beta, \sigma^2) = (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2} (Y - X\beta)^T(Y - X\beta)\right)\]

Log-Likelihood Function

\[\ell(\beta, \sigma^2) = \log L(\beta, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} (Y - X\beta)^T(Y - X\beta)\]

\[\ell(\beta, \sigma^2) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} (Y - X\beta)^T(Y - X\beta)\]

Finding MLE for \(\beta\)

Taking the partial derivative with respect to \(\beta\):

\[\frac{\partial \ell}{\partial \beta} = \frac{1}{\sigma^2} X^T(Y - X\beta)\]

Setting equal to zero: \[X^T(Y - X\hat{\beta}_{MLE}) = 0\]

\[X^TY - X^TX\hat{\beta}_{MLE} = 0\]

\[X^TX\hat{\beta}_{MLE} = X^Ty\]

Therefore: \[\hat{\beta}_{MLE} = (X^TX)^{-1}X^Ty\]

Finding MLE for \(\sigma^2\)

Taking the partial derivative with respect to \(\sigma^2\):

\[\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} (Y - X\beta)^T(Y - X\beta)\]

Setting equal to zero: \[-\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} (Y - X\hat{\beta}_{MLE})^T(Y - X\hat{\beta}_{MLE}) = 0\]

Multiplying by \(2\sigma^4\): \[-n\sigma^2 + (y - X\hat{\beta}_{MLE})^T(y - X\hat{\beta}_{MLE}) = 0\]

Therefore: \[\hat{\sigma}^2 = \frac{1}{n} (y - X\hat{\beta}_{MLE})^T(y - X\hat{\beta}_{MLE}) = \frac{SSR}{n}\]

Verification (Second-Order Conditions)

The Hessian matrix has:

\[\frac{\partial^2 \ell}{\partial \beta \partial \beta'} = -\frac{1}{\sigma^2} X^TX\]

This is negative definite (assuming \(X^TX\) is invertible), confirming \(\hat{\beta}\) is a maximum.

\[\frac{\partial^2 \ell}{\partial (\sigma^2)^2} = \frac{n}{2\sigma^4} - \frac{1}{\sigma^6} (y - X\beta)^T(y - X\beta)\]

At the MLE: \(\frac{\partial^2 \ell}{\partial (\sigma^2)^2}\bigg|_{\hat{\sigma}^2} = \frac{n}{2\sigma^4} - \frac{n}{\sigma^4} = -\frac{n}{2\sigma^4} < 0\)

This confirms \(\hat{\sigma}^2\) is a maximum.

Key Properties

  • Consistency: Both estimators are consistent
  • Bias: \(\hat{\beta}\) is unbiased, but \(\hat{\sigma}^2\) is biased (divides by \(n\) instead of \(n-k\))
  • Efficiency: Under normality, these MLEs achieve the Cramér-Rao lower bound
  • Relationship to OLS: \(\hat{\beta}_{MLE} = \hat{\beta}_{OLS}\) under normality assumption

\(\hat \beta\) is an Efficient Estimator in the Gaussian Model

Theorem

In the Gaussian Model, \(\hat \beta\) is an efficient estimator of \(\hat \beta\). This means that \[ \Var(\hat \beta) \preceq \Var(\tilde \beta)\; , \] for any estimator \(\tilde \beta\)

Setup

Consider the linear regression model: \[Y = X\beta + \varepsilon, \quad \varepsilon \sim N(0, \sigma^2 I)\]

We want to prove that \(\hat \beta = (X^TX)^{-1}X^TY\) is efficient.

Definition of Efficiency

An unbiased estimator is efficient if it achieves the Cramér-Rao lower bound: \[\text{Var}(\hat \beta) = [I(\beta)]^{-1}\] where \(I(\beta)\) is the Fisher Information Matrix. This comes from the Cramér-Rao lower bound

Step 1: Fisher Information Matrix

The log-likelihood function is: \[\ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}(Y - X\beta)^T(Y - X\beta)\]

First derivative with respect to \(\beta\): \[\frac{\partial \ell}{\partial \beta} = \frac{1}{\sigma^2}X^T(Y - X\beta)\]

Second derivative: \[\frac{\partial^2 \ell}{\partial \beta \partial \beta^T} = -\frac{1}{\sigma^2}X^TX\]

Fisher Information Matrix for \(\beta\): \[I(\beta) = -\mathbb{E}\left[\frac{\partial^2 \ell}{\partial \beta \partial \beta^T}\right] = \frac{1}{\sigma^2}X^TX\]

Cramér-Rao lower bound: \[[I(\beta)]^{-1} = \sigma^2(X^TX)^{-1}\]

Step 2: Variance of \(\hat \beta\)

\[\hat \beta = (X^TX)^{-1}X^TY = (X^TX)^{-1}X^T(X\beta + \varepsilon) = \beta + (X^TX)^{-1}X^T\varepsilon\]

Since \(\varepsilon \sim N(0, \sigma^2 I)\): \[\text{Var}(\hat \beta) = \text{Var}((X^TX)^{-1}X^T\varepsilon)\]

\[= (X^TX)^{-1}X^T \cdot \text{Var}(\varepsilon) \cdot X(X^TX)^{-1}\]

\[= (X^TX)^{-1}X^T \cdot \sigma^2 I \cdot X(X^TX)^{-1}\]

\[= \sigma^2(X^TX)^{-1}X^TX(X^TX)^{-1}\]

\[= \sigma^2(X^TX)^{-1}\]

Step 3: Verification of Efficiency

We have shown: - Cramér-Rao bound: \([I(\beta)]^{-1} = \sigma^2(X^TX)^{-1}\) - Variance of \(\hat \beta\): \(\text{Var}(\hat \beta) = \sigma^2(X^TX)^{-1}\)

Since: \[\text{Var}(\hat \beta) = [I(\beta)]^{-1}\]

The estimator \(\hat \beta\) achieves the Cramér-Rao lower bound.

Conclusion

Therefore, \(\hat \beta = (X^TX)^{-1}X^TY\) is an efficient estimator of \(\beta\) in the Gaussian linear regression model.

Additional Notes

  • This efficiency holds specifically under the normality assumption
  • \(\hat \beta\) is also the Best Linear Unbiased Estimator (BLUE) by the Gauss-Markov theorem
  • Under normality, \(\hat \beta\) is the Best Unbiased Estimator (BUE) among all estimators, not just linear ones