Introduction to the Generalized Linear Model
\(\newcommand{\VS}{\quad \mathrm{VS} \quad}\) \(\newcommand{\and}{\quad \mathrm{and} \quad}\) \(\newcommand{\E}{\mathbb E}\) \(\newcommand{\P}{\mathbb P}\) \(\newcommand{\Var}{\mathbb V}\) \(\newcommand{\Cov}{\mathrm{Cov}}\) \(\newcommand{\1}{\mathbf 1}\)
Limits of the Linear Model
Recalling the Linear Model
. . .
We observe \(Y = (Y_1, \dots, Y_n)\) and \(X = (X^{(1)} , . . . , X^{(p)}) \in \mathbb R^{n \times p}\),
. . .
In the Linear Model, We assume that for some unknown \(\beta\), \(\sigma^2\) where \(\varepsilon \sim \mathcal N(0, \sigma^2 I_n)\),
\[Y = X\beta + \varepsilon\]
. . .
The hypothesis can be written in the form \(\mathbb E[Y|X] = X\beta\)
. . .
The OLS estimator of \(\beta\) is \(\hat \beta= (X^TX)^{-1}X^TY\)
When this Linearity is Reasonable
. . .
The hypothesis \(\E(Y|X) = X^T\beta\) in linear regression models implies that \(\E(Y|X)\) can take any real value.
. . .
This is not a restriction when:
- \(Y|X\) follows a Gaussian distribution
- \(Y|X\) follows any other “nice” continuous distribution on \(\mathbb{R}\)
When it is not
. . .
The linear assumption is inappropriate for certain variables \(Y\), particularly when \(Y\) is qualitative or discrete.
When it is not
. . .
Binary outcomes (\(Y = 0\) or \(1\)):
- disease presence
. . .
Categorical outcomes (\(Y \in \{A_1, \ldots, A_k\}\)):
- Transportation choice: \(\{\) Car, Bus, Bike \(\}\)
. . .
Count data (\(Y \in \mathbb{N}\)):
- Number of traffic accident per month
Objectives of the GLM
Key Differences by Response Type
. . .
In all examples, the objective remains to link \(Y\) to \(X = (X^{(1)}, \ldots, X^{(p)})\) through modeling \(\E(Y|X)\).
. . .
However, \(\E(Y|X)\) has different interpretations depending on the situation:
- Binary \(Y\): \(Y = 0\) or \(1\)
- Categorical \(Y\): \(Y \in \{A_1, \ldots, A_k\}\)
- Count data: \(Y \in \mathbb{N}\)
. . .
In all these cases, the linear model \(\E(Y|X) = X^T\beta\) is inappropriate.
Objectives of the GLM
. . .
We model \(\E(Y|X)\) differently using generalized linear models.
As in linear regression, we focus on:
- Specific effects: The individual effect of a given regressor, (all other things being equal)
- Explanation: Understanding relationships
- Prediction: Forecasting outcomes
Modeling \(\E(Y|X)\): Three Fundamental Cases
Three Fundamental Cases
. . .
We detail the modeling challenges for \(\E(Y|X)\) in three fundamental cases:
Case 1: Binary: \(Y\) is binary (takes values 0 or 1)
Case 2: Categorical: \(Y \in \{A_1, \ldots, A_k\}\) (general qualitative variable)
Case 3: Count: \(Y \in \mathbb{N}\) (count variable)
Case 1: Binary Case
. . .
Without loss of generality, \(Y \in \{0, 1\}\)
. . .
If \(Y\) models membership in a category \(A\), this is equivalent to studying the variable \(Y = \mathbf{1}_A\)
. . .
The distribution of \(Y\) given \(X = x\) is entirely determined by \(p(x) = P(Y = 1|X = x)\)
. . .
We deduce \(P(Y = 0|X = x) = 1 - p(x)\)
. . .
\(Y|X = x\) follows a Bernoulli distribution with parameter \(p(x)\)
. . .
\(\E(Y|X = x) = p(x)\)
. . .
key constraint: \(p(x) \in [0, 1]\)
Modelling \(p(x)\)
. . .
\[E(Y|X = x) = P(Y = 1|X = x) = p(x) \in [0, 1]\]
. . .
What NOT to do: \(p(x) = x^T\beta\) (for some \(\beta \in \mathbb{R}^p\) to be estimated)
. . .
Proposed approach: We can propose a model of the type:
\[p(x) = f(x^T\beta)\]
where \(f\) is a function from \(\mathbb{R}\) to \([0, 1]\)
. . .
Benefits: Coherent model that depends only on \(\beta\)
Case 2: Categorical \(Y\)
. . .
If \(Y\) represents membership in \(k\) different classes \(A_1, \ldots, A_k\), its distribution is determined by the probabilities:
\[p_j(x) = P(Y \in A_j|X = x), \quad \text{for } j = 1, \ldots, k\]
. . .
Constraint: \(\sum_{j=1}^{k} p_j(x) = 1\) (If \(k = 2\), this reduces to the previous case)
. . .
Case 2: Categorical \(Y\)
. . .
\(Y = (\mathbf{1}_{A_1}, \ldots, \mathbf{1}_{A_k})\) follows a multinomial distribution and:
\[\E(Y|X = x) = \begin{pmatrix} p_1(x) \\ \vdots \\ p_k(x) \end{pmatrix}\]
. . .
Case 2: Model for Categorical \(Y\)
. . .
To model \(\E(Y|X = x)\), it suffices to model \(p_1(x), \ldots, p_{k-1}(x)\) since \(p_k(x) = 1 - \sum_{j=1}^{k-1} p_j(x)\)
. . .
Proposed model: As in the binary case, we can propose:
\[p_j(x) = f(x^T\beta_j), \quad j = 1, \ldots, k-1\]
where \(f: \mathbb{R} \to [0,1]\)
. . .
Parameters: There will be \(k-1\) unknown parameters to estimate, each in \(\mathbb{R}^p\)
Case 3: Count Y - Non-negative Integer Values
. . .
If \(Y\) takes integer values, we have for all \(x\), \(E(Y|X = x) \geq 0\)
. . .
Coherent choice: A coherent approach is:
\[\E(Y|X = x) = f(x^T\beta)\]
where \(f\) is a function from \(\mathbb{R}\) to \([0, +\infty)\)
. . .
Example of possible choice for f: The exponential function
Model Formulation
. . .
Let \(g\) be a strictly monotonic function, called the link function
. . .
A generalized linear model (GLM) establishes a relationship of the type:
\[g(\E(Y|X = x)) = x^T\beta\]
Equivalently,
\[\E(Y|X = x) = g^{-1}(x^T\beta)\]
. . .
Remarks on Model
- In the previous examples, \(g^{-1}\) was denoted \(f\)
- We generally assume that the distribution \(Y|X\) belongs to the exponential family, with unknown parameter \(\beta\)
- This allows us to compute the likelihood and facilitates inference
General Objectives
. . .
In a GLM model, the goal is to estimate \(\beta \in \mathbb{R}^p\)
. . .
Using \(n\) independent observations of \((Y, X)\), we use maximum likelihood estimation (the distribution of \(Y|X\) being known up to \(\beta\))
. . .
The link function \(g\) is not to be estimated: we choose it according to the nature of the data.
. . .
Inference and diagnostic tools are available (as in linear regression)
Remark on the Intercept
. . .
Among the explanatory variables \(X^{(1)}, \ldots, X^{(p)}\), we often assume that \(X^{(1)}=\1\) to account for the presence of a constant. Thus: \[X\beta = \beta_1 X^{(1)} + \cdots + \beta_p X^{(p)}\]
. . .
Alternative notation: Sometimes indexed differently to write \(\beta_0 + \beta_1 X^{(1)} + \cdots + \beta_p X^{(p)}\)
Example 1: Linear Regression Model
. . .
Link function: We recover linear regression by taking the identity link function \(g(t) = t\)
. . .
Expected value: Then:
\[E(Y|X = x) = g^{-1}(x^T\beta) = x^T\beta\]
. . .
In the Gaussian linear model:
\[Y|X \sim \mathcal{N}(X\beta, \sigma^2)\]
. . .
Linear regression is therefore a special case of GLM models!
Example 2: Binary Case \(Y \in \{0, 1\}\)
. . .
Link function requirement: The link function \(g\) must satisfy:
\[E(Y|X = x) = g^{-1}(x^T\beta) \in [0, 1]\]
. . .
Since \(Y\in \{0,1\}\), \(Y|X\) follows a Bernoulli distribution
\[Y|X \sim \mathcal{B}(g^{-1}(X^T\beta))\]
Example 2: Binary Case \(Y \in \{0, 1\}\)
. . .
\[Y|X \sim \mathcal{B}(g^{-1}(X^T\beta))\]
Possible choices for \(g^{-1}\): A CDF of a continuous distrib. on \(\mathbb{R}\)
. . .
Standard choice for \(g^{-1}\): The CDF of a logistic distribution:
\[g^{-1}(t) = \frac{e^t}{1 + e^t} \quad \text{i.e.} \quad g(t) = \ln\left(\frac{t}{1-t}\right) = \text{logit}(t)\]
. . .
This leads to the logistic model, the most important model in this chapter
Example 3: Count Data \(Y\in \mathbb N\)
. . .
Link function: \(g(t) = \ln(t)\), \(g^{-1}(t) = e^t\) gives:
\[E(Y|X = x) = g^{-1}(x^T\beta) = e^{x^T\beta}\]
. . .
For the distribution of \(Y|X\), defined on \(\mathbb{N}\), we often assume it follows a Poisson distribution (exp. familily)
. . .
In this context:
\[Y|X \sim \mathcal{P}(e^{X^T\beta})\]
Summary
There are 2 choices to make when setting up a GLM model:
- The distribution of \(Y|X\)
- The link function \(g\) defining \(E(Y|X) = g^{-1}(X^T\beta)\)
. . .
Key insight: The second choice is linked to the first
The 3 Common Cases
. . .
Binary (\(Y \in \{0, 1\}\)):
→ \(Y|X\): it’s a Bernoulli distribution
→ By default \(g = \text{logit}\) (see later)
. . .
Multi-category (\(Y \in \{A_1, \ldots, A_k\}\)):
→ \(Y|X\): it’s a multinomial distribution
→ By default \(g = \text{logit}\)
. . .
Count (\(Y \in \mathbb{N}\)):
→ \(Y|X\): Poisson (often) or negative binomial
→ Choice of \(g\): by default \(g = \ln\)