We observe \(n\) individuals, and variables \(Y \in \mathbb R^n\) and \((X^{(1)}, \dots, X^{(p)}) \in \mathbb R^{n \times p}\).
In other words, we observe
\(Y= (Y_1, \dots, Y_n) \in \mathbb R^n\)
\(X^{(1)} = (X^{(1)}_1, \dots, X^{(1)}_n) \in \mathbb R^n\)
\(X^{(2)} = (X^{(2)}_1, \dots, X^{(2)}_n) \in \mathbb R^n\)
…
\(X^{(p)} = (X^{(p)}_1, \dots, X^{(p)}_n)\in \mathbb R^n\)
We observe \(n\) individuals, and variables \(Y \in \mathbb R^n\) and \((X^{(1)}, \dots, X^{(p)}) \in \mathbb R^{n \times p}\).
We assume that
\[Y_i = F(X^{(1)}_i, X^{(2)}_i, \dots, X^{(p)}_i, \varepsilon_i)\]
where \(\varepsilon = (\varepsilon_1, \dots, \varepsilon_n) \in \mathbb R^n\) are iid random noise
\(\varepsilon\) is not observed
\(F\) is unknown
→ Too ambitious — risk of overfitting
We observe \(n\) individuals, and variables \(Y \in \mathbb R^n\) and \((X^{(1)}, \dots, X^{(p)}) \in \mathbb R^{n \times p}\).
We assume that
\[Y = \beta_1 X^{(1)}+ \beta_2 X^{(2)}+ \dots+ \beta_p X^{(p)}+ \varepsilon\]
That is, we know that \(F\) is of the form \(F(x_1, \dots, x_p, \varepsilon) = \beta_1 x_1+ \beta_2 x_2+ \dots+ \beta_p x_p+ \varepsilon\)
\(Y\) and the \(X^{(k)}\)’s are vectors in \(\mathbb R^n\).
For all \(i\),
\[Y_i = \beta_1 X^{(1)}_i+ \beta_2 X^{(2)}_i+ \dots+ \beta_p X^{(p)}_i+ \varepsilon_i\]
We assume that
\(Y\) and the \(X^{(k)}\)’s are vectors in \(\mathbb R^n\).
If we set \(X^{(1)}= (1, \dots, 1)\), then the model rewrites
\[Y_i = \beta_1+ \beta_2 X^{(2)}_i+ \dots+ \beta_p X^{(p)}_i+ \varepsilon_i\]
We write \(Y = (Y_1, \dots, Y_n)\) and \(X^{(k)} = (X^{(k)}_1, \dots, X^{(k)}_n)\) as columns:
\(\newcommand{\VS}{\quad \mathrm{VS} \quad}\) \(\newcommand{\and}{\quad \mathrm{and} \quad}\) \(\newcommand{\E}{\mathbb E}\) \(\newcommand{\P}{\mathbb P}\) \(\newcommand{\Var}{\mathbb V}\)
\[Y = \begin{pmatrix} Y_1 \\ \vdots \\ Y_n \end{pmatrix} \and X^{(k)}=\begin{pmatrix} X^{(k)}_1 \\ \vdots \\ X^{(k)}_n \end{pmatrix}\]
To get a matrix form, we write \(X_{ik} = X^{(k)}_i\). Then:
\(\newcommand{\VS}{\quad \mathrm{VS} \quad}\) \(\newcommand{\and}{\quad \mathrm{and} \quad}\)
\[Y = \begin{pmatrix} Y_1 \\ \vdots \\ Y_n \end{pmatrix} \and X^{(k)}=\begin{pmatrix} X_{1,k} \\ \vdots \\ X_{n,k} \end{pmatrix}\]
To get a matrix form, we write \(X\) for the matrix \((X_{ik}) \in \mathbb R^{n \times p}\)
That is, \(X = (X^{(1)}, \dots, X^{(p)})\)
And:
\[Y = \begin{pmatrix} Y_1 \\ \vdots \\ Y_n \end{pmatrix} \and X=\begin{pmatrix} X_{1,1} &\dots &X_{1,p} \\ \vdots & &\vdots \\ X_{n,1} &\dots &X_{n,p} \end{pmatrix}\]
Let \(\beta = (\beta_1, \dots, \beta_p) \in \mathbb R^p\) be unknown parameters, and \(\varepsilon = (\varepsilon_1, \dots, \varepsilon_n)\) be iid noise.
In column notation:
\[\beta = \begin{pmatrix} \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} \and \varepsilon=\begin{pmatrix} \varepsilon_{1} \\ \vdots \\ \varepsilon_{n} \end{pmatrix}\]
We observe \(Y = (Y_1, \dots, Y_n) \in \mathbb R^n\) and \(X \in \mathbb R^{n \times p}\)
We assume that
\[Y = X \beta + \varepsilon\]
where
Set the true (blue) plane \(Y=c+a\,X^{(1)}+b\,X^{(2)}\) with \(a,b,c\) — the data follow it (+ noise). snap to OLS computes the estimated (green) plane from the data; the gap blue↔︎green is the estimation error. Drag to rotate.
The model assumes \(\mathbb V(\varepsilon_i)=\sigma^2\) for every \(i\) (homoscedastic — the band keeps constant width). Slide to break it: the variance grows with \(x\) (heteroscedastic — a funnel), and \(\mathbb V(\varepsilon)=\sigma^2 I_n\) no longer holds.
Recall that \(X \in \mathbb R^{n \times p}\)
We assume that \(rk(X)=p\).
This implies \(p \leq n\)
If this condition is not satisfied:
It means that there is a linear relation between the \(X^{(k)}\)!
It means that \(X\alpha=\alpha_1X^{(1)} + \dots + \alpha_p X^{(p)}=0\) for some \(\alpha \in \mathbb R^p\setminus \{0\}\)
We can take infinitely many possible \(\beta\), since for \(t \in \mathbb R\),
\[ X(\beta + t\alpha) = X\beta \]
Map of the least-squares cost \(L(\beta)=\lVert Y-X\beta\rVert^2\) over \((\beta_1,\beta_2)\). With \(\mathrm{rk}(X)=2\) it has a single minimum \(\hat\beta\) (gold); as \(X^{(2)}\to X^{(1)}\) the basin stretches into a flat valley — infinitely many \(\beta\) with \(X(\beta+t\alpha)=X\beta\) — and \(\cos(X_1,X_2)\to 1\) (equality in Cauchy–Schwarz).