Linear Regression Model

Estimation of \(\hat \beta\)

Formula

By definition of the orthogonal projection, the vector \(Y - X\hat{\beta}\) is orthogonal to any vector in the space \([X]\), in particular to each of the vectors \(X^{(j)}\). Thus, for \(j = 1, \ldots, p\), the scalar product between \(Y - X\hat{\beta}\) and \(X^{(j)}\) is zero, i.e.

\[\left(X^{(j)}\right)^T(Y - X\hat{\beta}) = 0.\]

Since this relation is true for all \(j = 1, \ldots, p\), we have

\[X^T(Y - X\hat{\beta}) = 0,\]

Other possible idea

We get the same equation if we instead use that the gradient of \(\|Y-X\beta\|_2^2\) must be \(0\) at \(\beta=\hat \beta\).

where this time \(0\) denotes the zero vector of size \(p\). This implies \(X^T X\hat{\beta} = X^T Y\). Finally, the hypothesis \(\text{rg}(X) = p\) guarantees that the matrix \(X^T X\) is invertible, and we obtain the result

\[\hat{\beta} = (X^T X)^{-1}X^T Y \enspace .\]

Expectation

We compute:

\[\mathbb{E}(\hat{\beta}) = \mathbb{E}((X^T X)^{-1}X^T Y) = (X^T X)^{-1}X^T \mathbb{E}(Y)\]

since \((X^T X)^{-1}X^T\) is a deterministic matrix and expectation is linear. Furthermore

\[\mathbb{E}(Y) = \mathbb{E}(X\beta + \varepsilon) = X\beta + \mathbb{E}(\varepsilon) = X\beta\]

since \(\mathbb{E}(\varepsilon) = 0\). We indeed obtain

\[\mathbb{E}(\hat{\beta}) = (X^T X)^{-1}X^T X\beta = \beta \enspace .\]

Variance

For the variance, let us recall that for any deterministic matrix \(A\) and for any random vector \(Z\), \(\mathbb{V}(AZ) = A\mathbb{V}(Z)A^T\). We thus have

\[\mathbb{V}(\hat{\beta}) = \mathbb{V}((X^T X)^{-1}X^T Y) = (X^T X)^{-1}X^T \mathbb{V}(Y)((X^T X)^{-1}X^T)^T.\]

Now \(\mathbb{V}(Y) = \mathbb{V}(X\beta + \varepsilon) = \mathbb{V}(\varepsilon) = \sigma^2 I_n\) and \(((X^T X)^{-1}X^T)^T = (X^T)^T ((X^T X)^{-1})^T = X(X^T X)^{-1}\), since the matrix \(X^T X\) (and therefore also \((X^T X)^{-1}\)) is symmetric.

We therefore arrive at

\[\mathbb{V}(\hat{\beta}) = (X^T X)^{-1}X^T \sigma^2 I_n X(X^T X)^{-1} = \sigma^2(X^T X)^{-1}X^T X(X^T X)^{-1} = \sigma^2(X^T X)^{-1}.\]

Hence the result:

\[\mathbb{V}(\hat{\beta}) = \sigma^2(X^T X)^{-1}.\]

Convergence of \(\hat \beta\)

For the vector estimator \(\hat{\beta}\), the mean squared error is a scalar defined by: \[\text{MSE}(\hat{\beta}) = \mathbb{E}[\|\hat{\beta} - \beta\|^2]\]

The MSE decomposes into variance plus squared bias. For a vector estimator: \[\text{MSE}(\hat{\beta}) = \mathbb{E}[\|\hat{\beta} - \beta\|^2] = \mathbb{E}[\text{Tr}((\hat{\beta} - \beta)(\hat{\beta} - \beta)^T)]= \text{Tr}(\mathbb{E}[(\hat{\beta} - \beta)(\hat{\beta} - \beta)^T])=\sigma^2 \text{Tr}((X^TX)^{-1})\]

Hence, we have convergence as soon as \(\text{Tr}((X^TX)^{-1})\) goes to \(0\) as \(n \to \infty\).

Gauss Markov

\(\newcommand{\VS}{\quad \mathrm{VS} \quad}\) \(\newcommand{\and}{\quad \mathrm{and} \quad}\) \(\newcommand{\E}{\mathbb E}\) \(\newcommand{\P}{\mathbb P}\) \(\newcommand{\Var}{\mathbb V}\)

Note

Under the same assumptions, if \(\tilde \beta\) is another linear and unbiased estimator then \[\mathbb V(\hat \beta) \preceq \mathbb V(\tilde \beta),\]

where \(A\preceq B\) means that \(B-A\) is a symmetric positive semidefinite matrix

Since \(\tilde \beta\) is a linear estimator, then there exists a known matrix \(M\) such that \(\tilde \beta = MY\). Since it is also assumed to be unbiased, we have \(\mathbb{E}(\tilde{\beta}) = \beta\) for all \(\beta\), that is \(M\mathbb{E}(Y) = \beta\), in other words \(MX\beta = \beta\) for all \(\beta\). This last relation means that \(MX = I_p\), which is the only matrix satisfying this equality for all \(\beta\).

There are several ways to prove the theorem. One of them consists in starting from the calculation of the covariance matrix between \(\tilde{\beta}\) and \(\hat{\beta}\):

\[\text{Cov}(\tilde{\beta}, \hat{\beta}) = \mathbb{E}((\tilde{\beta} - \mathbb{E}(\tilde{\beta}))(\hat{\beta} - \mathbb{E}(\hat{\beta}))^T).\]

Now \(\tilde{\beta} - \mathbb{E}(\tilde{\beta}) = MY - \beta = M(X\beta + \varepsilon) - \beta = M\varepsilon\) since \(MX\beta = \beta\). Similarly \(\hat{\beta} - \mathbb{E}(\hat{\beta}) = (X^TX)^{-1}X^T\varepsilon\) (this is the case \(M = (X^TX)^{-1}X^T\)). We therefore obtain

\[\text{Cov}(\tilde{\beta}, \hat{\beta}) = \mathbb{E}(M\varepsilon\varepsilon^T((X^TX)^{-1}X^T)^T) = M\mathbb{E}(\varepsilon\varepsilon^T)X(X^TX)^{-1}.\]

Now since \(\varepsilon\) is centered, \(\mathbb{E}(\varepsilon\varepsilon^T) = \mathbb{V}(\varepsilon) = \sigma^2I_n\), and since \(MX = I_p\) we conclude

\[\text{Cov}(\tilde{\beta}, \hat{\beta}) = \sigma^2(X^TX)^{-1}.\]

This means that \(\text{Cov}(\tilde{\beta}, \hat{\beta}) = \mathbb{V}(\hat{\beta})\). We wish at this stage to use the Cauchy-Schwarz inequality, but since this last equality concerns matrices, we must first reduce to scalars. Let \(c\) be any vector of dimension \(p\). From the previous equality we deduce \(c^T\text{Cov}(\tilde{\beta}, \hat{\beta})c = c^T\mathbb{V}(\hat{\beta})c\), that is \(\text{Cov}(c^T\tilde{\beta}, c^T\hat{\beta}) = \mathbb{V}(c^T\hat{\beta})\). This equality concerns scalars. We now apply the Cauchy-Schwarz inequality:

\[\mathbb{V}(c^T\hat{\beta}) = \text{Cov}(c^T\tilde{\beta}, c^T\hat{\beta}) \leq \sqrt{\mathbb{V}(c^T\tilde{\beta})\mathbb{V}(c^T\hat{\beta})}.\]

By simplifying both sides by \(\sqrt{\mathbb{V}(c^T\hat{\beta})}\), we arrive at

\[\sqrt{\mathbb{V}(c^T\hat{\beta})} \leq \sqrt{\mathbb{V}(c^T\tilde{\beta})},\]

and therefore \(\mathbb{V}(c^T\hat{\beta}) \leq \mathbb{V}(c^T\tilde{\beta})\). Since this inequality is valid for any vector \(c\), this means that \(\mathbb{V}(\tilde{\beta}) - \mathbb{V}(\hat{\beta})\) is positive semi-definite, which is what we wanted to show.

Estimation of \(\hat \sigma^2\)

Expectation (unbiasedness)

First, let’s verify that \(\hat{\sigma}^2\) is unbiased:

\[\hat{\varepsilon} = P_{[X]^\perp}Y = P_{[X]^\perp}(X\beta + \varepsilon) = P_{[X]^\perp}\varepsilon\]

since \(P_{[X]^\perp}X = 0\).

Therefore: \[\hat \sigma^2=\|\hat{\varepsilon}\|^2 = \varepsilon^T P_{[X]^\perp}\varepsilon\]

Since \(P_{[X]^\perp}\) is idempotent and symmetric with \(\text{tr}(P_{[X]^\perp}) = n - p\):

\[\mathbb{E}[\|\hat{\varepsilon}\|^2] = \mathbb{E}[\varepsilon^T P_{[X]^\perp}\varepsilon] = \sigma^2 \text{tr}(P_{[X]^\perp}) = \sigma^2(n-p)\]

Thus:

\[\mathbb{E}[\hat{\sigma}^2] = \frac{1}{n-p}\mathbb{E}[\|\hat{\varepsilon}\|^2] = \sigma^2\]

Variance

Setup: Consider the linear regression model \(Y = X\beta + \varepsilon\) where:

\(\hat{\varepsilon} = P_{[X]^\perp}\varepsilon\) with \(P_{[X]^\perp} = I - X(X^TX)^{-1}X^T\)
\(P_{[X]^\perp}\) is symmetric and idempotent with \(\text{tr}(P_{[X]^\perp}) = n - p\)

The estimator is: \(\hat{\sigma}^2 = \frac{1}{n-p}\|\hat{\varepsilon}\|^2 = \frac{1}{n-p}\varepsilon^T P_{[X]^\perp}\varepsilon\)

Variance in the Gaussian case (simpler proof)

Let us first assume that \(\varepsilon \sim \mathcal{N}(0, \sigma^2 I_n)\)

Key Observation: Since \(P_{[X]^\perp}\) is idempotent with rank \(n-p\), its eigenvalues are \((n-p)\) ones and \(p\) zeros. By spectral decomposition: \[P_{[X]^\perp} = Q\Lambda Q^T\] where \(Q\) is orthogonal and \(\Lambda = \text{diag}(1, \ldots, 1, 0, \ldots, 0)\).

Transform to Standard Form: Let \(Z = Q^T\varepsilon \sim \mathcal{N}(0, \sigma^2 I_n)\). Then: \[\varepsilon^T P_{[X]^\perp}\varepsilon = Z^T\Lambda Z = \sum_{i=1}^{n-p} Z_i^2\]

Chi-Square Distribution: Since \(Z_i \stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2)\): \[\frac{(n-p)\hat{\sigma}^2}{\sigma^2} = \sum_{i=1}^{n-p} \left(\frac{Z_i}{\sigma}\right)^2 \sim \chi^2_{n-p}\]

Moments from Chi-Square: Using properties of \(\chi^2_{n-p}\): - \(\mathbb{E}[\hat{\sigma}^2] = \sigma^2\) (unbiased) - \(\mathbb{V}(\hat{\sigma}^2) = \frac{2\sigma^4}{n-p}\)

L² Consistency: Since \(\hat{\sigma}^2\) is unbiased:

\[\mathbb{E}[(\hat{\sigma}^2 - \sigma^2)^2] = \mathbb{V}(\hat{\sigma}^2) = \frac{2\sigma^4}{n-p} \to 0 \text{ as } n \to \infty\]

Therefore, \(\hat{\sigma}^2 \xrightarrow{L^2} \sigma^2\) as \(n \to \infty\).

Bonus: We also get the exact distribution: \(\frac{(n-p)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}\) for all \(n\).

Bounded fourth moment case (more technical)

Now we do not assume that \(\epsilon\) is a gaussian vector. We assume the \(\varepsilon_i\) are iid, such that

\(\mathbb E[\varepsilon_i]=0\)
\(\mathbb V[\varepsilon_i] = \sigma^2\)
\(\mathbb E[\varepsilon_i^4]= \mu_4 < \infty\) \[\mathbb{V}(\hat{\sigma}^2) = \mathbb{E}[\hat{\sigma}^4] - (\mathbb{E}[\hat{\sigma}^2])^2 = \mathbb{E}[\hat{\sigma}^4] - \sigma^4\]

We have \[\mathbb{E}[\hat \sigma^4] = \mathbb{E}[(\varepsilon^ T P_{[X]^\perp}\varepsilon)^2]\]

Hence, the computation of the variance of a quadratic form gives

\[\mathbb V(\hat \sigma^2) = \frac{1}{(n-p)^2}\left(2\sigma^4\text{tr}(P_{[X]^\perp}^2) + (\mu_4 - 3\sigma^4)\sum_{i}(P_{[X]^\perp})_{ii}^2 \right) \enspace .\]

The fact that \(P_{[X]^\perp}^2=P_{[X]^\perp}\) and \(\text{tr}(P_{[X]^\perp}) = n-p\) gives We have

\[\mathbb V(\hat \sigma^2) = \frac{1}{(n-p)^2}\left(2\sigma^4(n-p) + (\mu_4 - 3\sigma^4)\sum_{i}(P_{[X]^\perp})_{ii}^2\right) \enspace .\]

Since \((P_{[X]^\perp})_{ii} = e_i^TP_{[X]^\perp}e_i \in [0,1]\) (the eigen values of an orthogonal projector are in \(\{0,1\}\)), we have \((P_{[X]^\perp})_{ii}^2 \leq (P_{[X]^\perp})_{ii}\).

Therefore:

\[\mathbb V(\hat \sigma^2) \leq \frac{1}{(n-p)}\left(2\sigma^4 + (\mu_4 - 3\sigma^4)\right) \to 0\]

when \(n \to \infty\).

Gaussian Case

We assume here that \(\varepsilon\) has distribution \(\mathcal N(0, \sigma^2 I_n)\).

Maximum Likelihood Estimator

MLE

Let \(\hat \beta_{MLE}\) and \(\hat \sigma_{MLE}^2\) be the MLE of \(\beta\) and \(\sigma^2\), respectively.

\(\hat{\beta}_{MLE} = \hat{\beta}\) et \(\hat{\sigma}^2_{MLE} = \frac{SCR}{n} = \frac{n-p}{n} \hat{\sigma}^2\).
\(\hat{\beta} \sim N(\beta, \sigma^2(X^TX)^{-1})\).
\(\frac{n-p}{\sigma^2} \hat{\sigma}^2 = \frac{n}{\sigma^2} \hat{\sigma}^2_{MLE} \sim \chi^2(n - p)\).
\(\hat{\beta}\) and \(\hat{\sigma}^2\) are independent

Setup

Model: \(Y = X\beta + \varepsilon\) where \(\varepsilon \sim N(0, \sigma^2 I_n)\)

This means: \(Y \sim N(X\beta, \sigma^2 I_n)\)

For \(n\) observations, the likelihood function is:

\[L(\beta, \sigma^2) = (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2} \|Y - X\beta\|^2\right)\]

\[\ell(\beta, \sigma^2) = \log L(\beta, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \|Y - X\beta\|^2\]

Existence and Uniqueness of Global Maximum

Let \(\beta^*\) be the minimizer of \(\|Y-X\tilde \beta\|\) over all \(\tilde \beta \in \mathbb R^p\). \(\beta^*\) the OLS estimator by definition, and it satisfies \(\ell(\beta, \sigma^2) \leq -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \|Y - X\beta^*\|^2 = \ell(\beta^*, \sigma^2)\). Since \(p < n\), we almost surely have that \(Y \not \in [X]\), so that \(\|Y - X\beta^*\|^2 > 0\).

Now that \(\beta^*\) is fixed, the function \(\sigma^2 \to -\frac{n}{2} \log(2\pi\sigma^2 ) - \frac{1}{2\sigma^2} \|Y - X\beta^*\|^2\) is concave on \((0, +\infty)\), and cancelling the gradient gives that it achieves its maximum at \((\sigma^*)^2 = \frac{1}{n}\|Y - X\beta^*\|^2\).

Hence, we obtain that for any \(\beta\) and \(\sigma\), \(\ell(\beta, \sigma^2) \leq \ell(\beta^*, \sigma^2) \leq \ell(\beta^*, (\sigma^*)^2)\). We just proved that \((\beta^*, (\sigma^*))^2\) is the unique global maximum, which implies that

\[ \begin{aligned} \hat \beta_{MLE}&= \beta^* = \hat \beta \\ \hat \sigma_{MLE}&= \sigma^* = \frac{1}{n}\|Y-X\hat \beta\|^2 \end{aligned} \]

\(\hat \beta\) is an Efficient Estimator in the Gaussian Model

Theorem

In the Gaussian Model, \(\hat \beta\) is an efficient estimator of \(\hat \beta\). This means that \[ \Var(\hat \beta) \preceq \Var(\tilde \beta)\; , \] for any estimator \(\tilde \beta\)

Cramer-Rao Lower Bound

It holds that for any estimator \(\tilde \beta\), \[\mathbb V(\tilde \beta) \succeq [I(\beta)]^{-1}\] where \(I(\beta)\) is the Fisher Information Matrix. This comes from the Cramér-Rao lower bound. Hence, it is enough to prove that \(\mathbb V(\tilde \beta) = [I(\beta)]^{-1}\).

Fisher Information Matrix

The log-likelihood function is: \[\ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}(Y - X\beta)^T(Y - X\beta)\]

First derivative with respect to \(\beta\): \[\frac{\partial \ell}{\partial \beta} = \frac{1}{\sigma^2}X^T(Y - X\beta)\]

Second derivative: \[\frac{\partial^2 \ell}{\partial \beta \partial \beta^T} = -\frac{1}{\sigma^2}X^TX\]

Hence, we obtain that \[[I(\beta)]^{-1} = -\left(\mathbb{E}\left[\frac{\partial^2 \ell}{\partial \beta \partial \beta^T}\right]\right)^{-1} = \sigma^2(X^TX)^{-1} = \mathbb V(\hat \beta) \enspace ,\]

which proves that \(\hat \beta\) is an efficient estimator of \(\beta\).

Additional Notes

This efficiency holds specifically under the normality assumption
\(\hat \beta\) is also the Best Linear Unbiased Estimator (BLUE) by the Gauss-Markov theorem
Under normality, \(\hat \beta\) is the Best Unbiased Estimator (BUE) among all estimators, not just linear ones