\(\newcommand{\VS}{\quad \mathrm{VS} \quad}\) \(\newcommand{\and}{\quad \mathrm{and} \quad}\) \(\newcommand{\E}{\mathbb E}\) \(\newcommand{\P}{\mathbb P}\) \(\newcommand{\Var}{\mathbb V}\) \(\newcommand{\Cov}{\mathrm{Cov}}\) \(\newcommand{\1}{\mathbf 1}\)
We observe \(Y = (Y_1, \dots, Y_n)\) and \(X = (X^{(1)} , . . . , X^{(p)}) \in \mathbb R^{n \times p}\),
In the Linear Model, We assume that for some unknown \(\beta\)
\[Y = X\beta + \varepsilon\]
The hypothesis can be written in the form \(\mathbb E[Y|X] = X\beta\)
The OLS estimator of \(\beta\) is \(\hat \beta= (X^TX)^{-1}X^TY\)
The hypothesis \(\E(Y|X) = X^T\beta\) in linear regression models implies that \(\E(Y|X)\) can take any real value.
This is not a restriction when:
The linear assumption is inappropriate for certain variables \(Y\), particularly when \(Y\) is qualitative or discrete.
Binary outcomes (\(Y = 0\) or \(1\)):
Categorical outcomes (\(Y \in \{A_1, \ldots, A_k\}\)):
Count data (\(Y \in \mathbb{N}\)):
In all examples, the objective remains to link \(Y\) to \(X = (X^{(1)}, \ldots, X^{(p)})\) through modeling \(\E(Y|X)\).
However, \(\E(Y|X)\) has different interpretations depending on the situation:
In all these cases, the linear model \(\E(Y|X) = X^T\beta\) is inappropriate.
We model \(\E(Y|X)\) differently using generalized linear models.
As in linear regression, we focus on:
We detail the modeling challenges for \(\E(Y|X)\) in three fundamental cases:
Case 1: Binary: \(Y\) is binary (takes values 0 or 1)
Case 2: Categorical: \(Y \in \{A_1, \ldots, A_k\}\) (general qualitative variable)
Case 3: Count: \(Y \in \mathbb{N}\) (count variable)
Without loss of generality, \(Y \in \{0, 1\}\)
If \(Y\) models membership in a category \(A\), this is equivalent to studying the variable \(Y = \mathbf{1}_A\)
The distribution of \(Y\) given \(X = x\) is entirely determined by \(p(x) = P(Y = 1|X = x)\)
We deduce \(P(Y = 0|X = x) = 1 - p(x)\)
\(Y|X = x\) follows a Bernoulli distribution with parameter \(p(x)\)
\(\E(Y|X = x) = p(x)\)
key constraint: \(p(x) \in [0, 1]\)
\[E(Y|X = x) = P(Y = 1|X = x) = p(x) \in [0, 1]\]
What NOT to do: \(p(x) = x^T\beta\) (for some \(\beta \in \mathbb{R}^p\) to be estimated)
Proposed approach: We can propose a model of the type:
\[p(x) = f(x^T\beta)\]
where \(f\) is a function from \(\mathbb{R}\) to \([0, 1]\)
Benefits: Coherent model that depends only on \(\beta\)
If \(Y\) represents membership in \(k\) different classes \(A_1, \ldots, A_k\), its distribution is determined by the probabilities:
\[p_j(x) = P(Y \in A_j|X = x), \quad \text{for } j = 1, \ldots, k\]
Constraint: \(\sum_{j=1}^{k} p_j(x) = 1\) (If \(k = 2\), this reduces to the previous case)
\(Y = (\mathbf{1}_{A_1}, \ldots, \mathbf{1}_{A_k})\) follows a multinomial distribution and:
\[\E(Y|X = x) = \begin{pmatrix} p_1(x) \\ \vdots \\ p_k(x) \end{pmatrix}\]
To model \(\E(Y|X = x)\), it suffices to model \(p_1(x), \ldots, p_{k-1}(x)\) since \(p_k(x) = 1 - \sum_{j=1}^{k-1} p_j(x)\)
Proposed model: As in the binary case, we can propose:
\[p_j(x) = f(x^T\beta_j), \quad j = 1, \ldots, k-1\]
where \(f: \mathbb{R} \to [0,1]\)
Parameters: There will be \(k-1\) unknown parameters to estimate, each in \(\mathbb{R}^p\)
If \(Y\) takes integer values, we have for all \(x\), \(E(Y|X = x) \geq 0\)
Coherent choice: A coherent approach is:
\[\E(Y|X = x) = f(x^T\beta)\]
where \(f\) is a function from \(\mathbb{R}\) to \([0, +\infty)\)
Example of possible choice for f: The exponential function
Let \(g\) be a strictly monotonic function, called the link function
A generalized linear model (GLM) establishes a relationship of the type:
\[g(\E(Y|X = x)) = x^T\beta\]
Equivalently,
\[\E(Y|X = x) = g^{-1}(x^T\beta)\]
In a GLM model, the goal is to estimate \(\beta \in \mathbb{R}^p\)
Using \(n\) independent observations of \((Y, X)\), we use maximum likelihood estimation (the distribution of \(Y|X\) being known up to \(\beta\))
The link function \(g\) is not to be estimated: we choose it according to the nature of the data.
Inference and diagnostic tools are available (as in linear regression)
Among the explanatory variables \(X^{(1)}, \ldots, X^{(p)}\), we often assume that \(X^{(1)}=\1\) to account for the presence of a constant. Thus: \[X\beta = \beta_1 X^{(1)} + \cdots + \beta_p X^{(p)} = \beta_1 + \beta_2 X^{(2)} + \cdots + \beta_p X^{(p)}\]
Alternative notation: Sometimes indexed differently to write \(\beta_0 + \beta_1 X^{(1)} + \cdots + \beta_p X^{(p)}\)
Link function: We recover linear regression by taking the identity link function \(g(t) = t\)
Expected value: Then:
\[E(Y|X = x) = g^{-1}(x^T\beta) = x^T\beta\]
In the Gaussian linear model:
\[Y|X \sim \mathcal{N}(X\beta, \sigma^2)\]
Linear regression is therefore a special case of GLM models!
Link function requirement: The link function \(g\) must satisfy:
\[E(Y|X = x) = g^{-1}(x^T\beta) \in [0, 1]\]
Since \(Y\in \{0,1\}\), \(Y|X\) follows a Bernoulli distribution
\[Y|X \sim \mathcal{B}(g^{-1}(X^T\beta))\]
\[Y|X \sim \mathcal{B}(g^{-1}(X^T\beta))\]
Possible choices for \(g^{-1}\): A CDF of a continuous distrib. on \(\mathbb{R}\)
Standard choice for \(g^{-1}\): The CDF of a logistic distribution:
\[g^{-1}(t) = \frac{e^t}{1 + e^t} \quad \text{i.e.} \quad g(t) = \ln\left(\frac{t}{1-t}\right) = \text{logit}(t)\]
This leads to the logistic model, the most important model in this chapter
Link function: \(g(t) = \ln(t)\), \(g^{-1}(t) = e^t\) gives:
\[E(Y|X = x) = g^{-1}(x^T\beta) = e^{x^T\beta}\]
For the distribution of \(Y|X\), defined on \(\mathbb{N}\), we often assume it follows a Poisson distribution (exp. familily)
In this context:
\[Y|X \sim \mathcal{P}(e^{X^T\beta})\]
There are 2 choices to make when setting up a GLM model:
Key insight: The second choice is linked to the first
Binary (\(Y \in \{0, 1\}\)):
→ \(Y|X\): it’s a Bernoulli distribution
→ By default \(g = \text{logit}\) (see later)
Multi-category (\(Y \in \{A_1, \ldots, A_k\}\)):
→ \(Y|X\): it’s a multinomial distribution
→ By default \(g = \text{logit}\)
Count (\(Y \in \mathbb{N}\)):
→ \(Y|X\): Poisson (often) or negative binomial
→ Choice of \(g\): by default \(g = \ln\)