Goodness of Fit Tests

Multinomials

Binomial distribution

  • Draw n balls, blue or red, with resampling
  • p_1, (1-p_1): proportion of blues/red [Wooclap]
  • X, Y: counts of blues/red
  • X \sim \mathrm{Bin}(n,p_1), Y=n-X \sim \mathrm{Bin}(n,1-p_1)
  • \mathbb P((X,Y) = (k_1,k_2)) = \binom{n}{k_1}p_1^k(1-p_1)^{k_2}
    if k_1 +k_2 = n

Multinomial distribution

  • Draw n balls, m potential colors, with resampling
  • (p_1, \dots, p_m): proportions of each color: \sum_{i=1}^m p_i = 1
  • X_1, \dots, X_m: count of each color [Wooclap]
  • (X_1, \dots, X_m) \sim \mathrm{Mult}(n,(p_1, \dots, p_m)) [Wooclap]
  • \mathbb P((X_1, \dots, X_m)=(k_1, \dots, k_m)) = \frac{n!}{k_1!\dots k_m!}p_1^{k_1} \dots p_m^{k_m}
    if k_1 + \dots + k_m = n
  • \frac{n!}{k_1!\dots k_m!} = \binom{n}{k_1,\dots, k_m} is a multinomial coefficient
  • In this course: n \gg m.
  • m \gg n corresponds to a high-dimensional setting.

\chi^2 Goodness of Fit

\chi^2 Goodness of Fit Test

  • We observe (X_1, \dots, X_m) \sim \mathrm{Mult}(n, q). This corresponds to n counts: X_1 + \dots + X_m = n
  • q = (q_1, \dots, q_m) corresponds to probabilities of getting color 1, \dots, m
  • Let p = (p_1, \dots, p_m) be a known vector s.t. p_1 + \dots + p_m = 1.
  • H_0:~ q = p ~~~\text{or}~~~ H_1: q \neq p \; .

\chi^2 Goodness of Fit Test (Adéquation)

  • Chi-squared test statistic:: \psi(X) = \sum_{i=1}^m\frac{(X_i-n_i)^2}{n_i} \; , where n_i = np_i = \mathbb E[X_i] is the expected number of counts for color i.
  • When np_i = n_i are large, under H_0, we approximate the distribution of \psi(X) as \psi(X) \sim \chi^2(m-1)
  • Reject if \psi(X) > t_{1- \alpha} (right-tail of \chi^2(m-1))

Example: Bag of Sweets

  • We observe a bag of sweets containing n=100 sweets of m=3 different colors: red, green, and yellow.
  • Manufacturer: p_1= 40\% red, p_2=35\% green, and p_3=25\% yellow.
  • H_0: q=p (manufacturer’s claim is correct)
  • H_1: q\neq p (manufacturer’s claim is incorrect)
Color Observed Counts
Red X_1=50
Green X_2=30
Yellow X_3=20
Expected Counts
n_1=40
n_2=35
n_3=25
  • \psi(X) = \sum_{i=1}^m\frac{(X_i-n_i)^2}{n_i} \approx 2.5 +0.71+1 \approx 4.21
  • \mathrm{cdf}(\chi^2(2), 4.21) \approx 0.878 (p_{value} \approx 0.222)
  • Conclusion: do not reject H_0

Comparison to a Theoretical Disrtribution

Histograms

histogram

  • We observe (X_1, \dots, X_n) \in \mathbb R^n
  • \mathrm{counts}(I) = \sum \mathbf 1\{X_i \in I\} \in \{1, \dots, n\}\; .
  • \mathrm{freq}(I) = \mathrm{counts}(I)/n
  • \mathrm{hist}(a,b,k) = (\mathrm{counts}(I_1), \dots,\mathrm{counts}(I_k))
  • where I_l = \big[a + (l-1)\tfrac{b-a}{k},a + l\tfrac{b-a}{k}\big)

Normalization

Can be normalized in counts (default), frequency, or density (area under the curve = 1)

Law of large numbers, Monte Carlo (informal)

  • Assume that (X_1, \dots, X_n) are iid of distrib P, and that a,b, k are fixed
  • The histogram \mathrm{hist}(a,b,k) converges to the histogram of the density P

\chi^2 Goodness of Fit to a given distribution

  • We observe (X_1, \dots, X_n) \in \mathbb R^n, iid with unknown distrib P
  • H_0: P = P_0, where P_0 is known
  • H_1: P \neq P_0 [Wooclap]
  • Idea: under H_0, the counts in disjoint intervals I_1, I_2, \dots, I_k follow a multinomial distribution \mathrm{Mult}(n, (p_1, \dots, p_{k})) where p_1 = P_0(I_1), \dots, p_{k} = P_0(I_k)

Reduction to chi-squared test statistic

  1. Count the number of data (c_1, \dots, c_k) in I_1, \dots, I_k
  2. Theoretical counts nP_0(I_1) \dots, nP_0(I_k)
  3. Chi-squared statistic: \sum_{j=1}^k \frac{(c_j - nP_0(I_j))^2}{nP_0(I_j)}
  4. Decide using an 1-\alpha-quantile of a \chi^2(k-1) distribution

Corrected \chi^2 Test

  1. If P_0 is unknown
  2. In a class \mathcal P parameterized by \ell parameters
  3. Then estimate the \ell parameters
  4. Compute theoretical counts with \hat P_{\hat \theta}
  5. But decide with \chi^2(k - 1 - \ell)

Example: Goodness of Fit to a Poisson distribution

H_0: X_i iid \mathcal P(2)

X = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 4, 3, 0, 1, 1, 2, 3, 0, 1, 0, 0, 2, 1, 0, 1, 0, 0, 2, 0, 0]
0 1 2  \geq 3 Total
Counts 16 8 3 3 30
Theoretical Counts 4.06 8.1 8.1 9.7
  • To get 9.7, we compute (1-cdf(Poisson(2),2))*30
  • chi square stat \gtrsim \frac{(16-4)^2}{4} = 36
  • (1-cdf(Chisq(3),36)) is very small: Reject

Warning

  • If H_0 is X_i iid \mathcal P(\lambda) with unknown \lambda
  • \hat \lambda = 0.8, \sum_{i=1}^4 \frac{(X_i - n_i)^2}{n_i} \approx 9.4
  • Not \chi^2(3) but \chi^2(2)
  • (1-cdf(Chisq(2),9.4)) \approx 0.009. Reject at level 1%

Comparison with QQ-Plots

  • We observe (X_1, \dots, X_n) of unknown CDF F
  • H_0: F = F_0 where F_0 is known
  • H_1: F \neq F_0
  • We write X_{(1)} \leq \dots \leq X_{(n)} for the ordered data
  • empirical \frac{k}{n}-quantile: X_{(k)} [Wooclap]
  • \frac{k}{n}-quantile: x such that F_0(x) = \frac{k}{n}

QQ-Plot

  • Represent the empirical quantiles in function of the theoretical quantiles.
  • Compare the scatter plot with y=x

Kolmogorov-Smirnov Test

  • We observe (X_1, \dots, X_n) of unknown CDF F
  • H_0: F = F_0 where F_0 is known
  • H_1: F \neq F_0
  • We write X_{(1)} \leq \dots \leq X_{(n)} for the ordered data
  • empirical \frac{k}{n}-quantile:
  • \frac{k}{n}-quantile: x such that F_0(x) = \frac{k}{n} [Wooclap]
  • Empirical CDF: \hat F(x) = \frac{1}{n}\sum_{i=1}^n \mathbf 1\{X_i \leq x\}
  • Idea: Max distance between empirical and true CDF

Kolmogorov-Smirnov test

  • \psi(X) = \sup_{x}|\hat F(x) - F_0(x)|
  • Approx: \mathbb P_0(\psi(X) >c/\sqrt{n}) \to 2\sum_{r=1}^{+\infty}(-1)^{r-1}\exp(-2c^2r^2) when n \to +\infty
  • In practice, use Julia, Python or R for KS Tests