Goodness of Fit Tests

Multinomials

Binomial distribution

Draw n balls, blue or red, with resampling
p_1, (1-p_1): proportion of blues/red [Wooclap]
X, Y: counts of blues/red
X \sim \mathrm{Bin}(n,p_1), Y=n-X \sim \mathrm{Bin}(n,1-p_1)
\mathbb P((X,Y) = (k_1,k_2)) = \binom{n}{k_1}p_1^k(1-p_1)^{k_2}
if k_1 +k_2 = n

Multinomial distribution

Draw n balls, m potential colors, with resampling
(p_1, \dots, p_m): proportions of each color: \sum_{i=1}^m p_i = 1
X_1, \dots, X_m: count of each color [Wooclap]
(X_1, \dots, X_m) \sim \mathrm{Mult}(n,(p_1, \dots, p_m)) [Wooclap]
\mathbb P((X_1, \dots, X_m)=(k_1, \dots, k_m)) = \frac{n!}{k_1!\dots k_m!}p_1^{k_1} \dots p_m^{k_m}
if k_1 + \dots + k_m = n

\frac{n!}{k_1!\dots k_m!} = \binom{n}{k_1,\dots, k_m} is a multinomial coefficient
In this course: n \gg m.
m \gg n corresponds to a high-dimensional setting.

\chi^2 Goodness of Fit

\chi^2 Goodness of Fit Test

We observe (X_1, \dots, X_m) \sim \mathrm{Mult}(n, q). This corresponds to n counts: X_1 + \dots + X_m = n
q = (q_1, \dots, q_m) corresponds to probabilities of getting color 1, \dots, m
Let p = (p_1, \dots, p_m) be a known vector s.t. p_1 + \dots + p_m = 1.
H_0:~ q = p ~~~\text{or}~~~ H_1: q \neq p \; .

\chi^2 Goodness of Fit Test (Adéquation)

Chi-squared test statistic:: \psi(X) = \sum_{i=1}^m\frac{(X_i-n_i)^2}{n_i} \; , where n_i = np_i = \mathbb E[X_i] is the expected number of counts for color i.
When np_i = n_i are large, under H_0, we approximate the distribution of \psi(X) as \psi(X) \sim \chi^2(m-1)
Reject if \psi(X) > t_{1- \alpha} (right-tail of \chi^2(m-1))

Example: Bag of Sweets

We observe a bag of sweets containing n=100 sweets of m=3 different colors: red, green, and yellow.
Manufacturer: p_1= 40\% red, p_2=35\% green, and p_3=25\% yellow.
H_0: q=p (manufacturer’s claim is correct)
H_1: q\neq p (manufacturer’s claim is incorrect)

Color	Observed Counts
Red	X_1=50
Green	X_2=30
Yellow	X_3=20

Expected Counts
n_1=40
n_2=35
n_3=25

\psi(X) = \sum_{i=1}^m\frac{(X_i-n_i)^2}{n_i} \approx 2.5 +0.71+1 \approx 4.21
\mathrm{cdf}(\chi^2(2), 4.21) \approx 0.878 (p_{value} \approx 0.222)
Conclusion: do not reject H_0

Comparison to a Theoretical Disrtribution

Histograms

histogram

We observe (X_1, \dots, X_n) \in \mathbb R^n
\mathrm{counts}(I) = \sum \mathbf 1\{X_i \in I\} \in \{1, \dots, n\}\; .
\mathrm{freq}(I) = \mathrm{counts}(I)/n
\mathrm{hist}(a,b,k) = (\mathrm{counts}(I_1), \dots,\mathrm{counts}(I_k))
where I_l = \big[a + (l-1)\tfrac{b-a}{k},a + l\tfrac{b-a}{k}\big)

Normalization

Can be normalized in counts (default), frequency, or density (area under the curve = 1)

Law of large numbers, Monte Carlo (informal)

Assume that (X_1, \dots, X_n) are iid of distrib P, and that a,b, k are fixed
The histogram \mathrm{hist}(a,b,k) converges to the histogram of the density P

\chi^2 Goodness of Fit to a given distribution

We observe (X_1, \dots, X_n) \in \mathbb R^n, iid with unknown distrib P
H_0: P = P_0, where P_0 is known
H_1: P \neq P_0 [Wooclap]
Idea: under H_0, the counts in disjoint intervals I_1, I_2, \dots, I_k follow a multinomial distribution \mathrm{Mult}(n, (p_1, \dots, p_{k})) where p_1 = P_0(I_1), \dots, p_{k} = P_0(I_k)

Reduction to chi-squared test statistic

Count the number of data (c_1, \dots, c_k) in I_1, \dots, I_k
Theoretical counts nP_0(I_1) \dots, nP_0(I_k)
Chi-squared statistic: \sum_{j=1}^k \frac{(c_j - nP_0(I_j))^2}{nP_0(I_j)}
Decide using an 1-\alpha-quantile of a \chi^2(k-1) distribution

Corrected \chi^2 Test

If P_0 is unknown
In a class \mathcal P parameterized by \ell parameters
Then estimate the \ell parameters
Compute theoretical counts with \hat P_{\hat \theta}
But decide with \chi^2(k - 1 - \ell)

Example: Goodness of Fit to a Poisson distribution

H_0: X_i iid \mathcal P(2)

X = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 4, 3, 0, 1, 1, 2, 3, 0, 1, 0, 0, 2, 1, 0, 1, 0, 0, 2, 0, 0]

	0	1	2	\geq 3	Total
Counts	16	8	3	3	30
Theoretical Counts	4.06	8.1	8.1	9.7

To get 9.7, we compute (1-cdf(Poisson(2),2))*30
chi square stat \gtrsim \frac{(16-4)^2}{4} = 36
(1-cdf(Chisq(3),36)) is very small: Reject

Warning

If H_0 is X_i iid \mathcal P(\lambda) with unknown \lambda
\hat \lambda = 0.8, \sum_{i=1}^4 \frac{(X_i - n_i)^2}{n_i} \approx 9.4
Not \chi^2(3) but \chi^2(2)
(1-cdf(Chisq(2),9.4)) \approx 0.009. Reject at level 1%

[Wooclap]

Comparison with QQ-Plots

We observe (X_1, \dots, X_n) of unknown CDF F
H_0: F = F_0 where F_0 is known
H_1: F \neq F_0
We write X_{(1)} \leq \dots \leq X_{(n)} for the ordered data
empirical \frac{k}{n}-quantile: X_{(k)} [Wooclap]
\frac{k}{n}-quantile: x such that F_0(x) = \frac{k}{n}

QQ-Plot

Represent the empirical quantiles in function of the theoretical quantiles.
Compare the scatter plot with y=x

Kolmogorov-Smirnov Test

We observe (X_1, \dots, X_n) of unknown CDF F
H_0: F = F_0 where F_0 is known
H_1: F \neq F_0
We write X_{(1)} \leq \dots \leq X_{(n)} for the ordered data
empirical \frac{k}{n}-quantile:
\frac{k}{n}-quantile: x such that F_0(x) = \frac{k}{n} [Wooclap]
Empirical CDF: \hat F(x) = \frac{1}{n}\sum_{i=1}^n \mathbf 1\{X_i \leq x\}
Idea: Max distance between empirical and true CDF

Kolmogorov-Smirnov test

\psi(X) = \sup_{x}|\hat F(x) - F_0(x)|
Approx: \mathbb P_0(\psi(X) >c/\sqrt{n}) \to 2\sum_{r=1}^{+\infty}(-1)^{r-1}\exp(-2c^2r^2) when n \to +\infty
In practice, use Julia, Python or R for KS Tests