Gaussian Populations

One Gaussian Population

Testing Mean with Known variance

We observe X = (X_1, \dots, X_n), iid with distribution \mathcal N(\mu, \sigma). We assume that \mu is unknown but that \sigma is known.

Test problems

\begin{aligned} H_0: \mu = \mu_0 ~~~~ &\text{ or } ~~~ H_1: \mu > \mu_0 ~~~ \text{(right-tailed)}\\ H_0: \mu = \mu_0 ~~~ &\text{ or } ~~~ H_1: \mu < \mu_0 ~~~ \text{(left-tailed)}\\ H_0: \mu = \mu_0 ~~~ &\text{ or } ~~~ H_1: \mu \neq \mu_0 ~~~ \text{(two-tailed)}\\ \end{aligned}

  • Test Statistic: \psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} \; .
  • \psi(X) \sim \mathcal N(0,1)
Tests

\begin{aligned} \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} > t_{1-\alpha} ~~~ \text{(right-tailed)}\\ \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} < t_{\alpha} ~~~ \text{( left-tailed)}\\ \left|\frac{\sqrt{n}(\overline X-\mu_0)}{\sigma}\right| > t_{1-\tfrac{\alpha}{2}}~~~ \text{(two-tailed)}\\ \end{aligned}

Why 0.05 and 1.96 ?

Fisher’s Quote

The value for which p=0.05, or 1 in 20, is 1.96 or nearly 2 ; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.

Testing Mean with Unknown Variance

  • Multiple VS Multiple Test Problem: H_0: \{\mu_0,\sigma > 0\} \text{ or } H_1: \{\mu \neq \mu_0,\sigma > 0\} \;.

  • \psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} no longer test statistic.

  • Idea: replace \sigma by its estimator \hat \sigma(X) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i - \overline X)^2} \; .

  • This gives \psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\hat \sigma} \; .

  • Is \psi(X) pivotal under H_0 ? What is its distribution ?

Chi-Square and Student Distributions

Chi-Squared Distribution \chi^2(k)
  • k: degree of freedom
  • Distrib of \sum_{i=1}^k Z_i^2
  • where the Z_i’s are iid \mathcal N(0,1).
  • \mathbb E[Z_i^4] - \mathbb E[Z_i^2]=2
  • \chi^2(k) \sim k + \sqrt{2k}\mathcal N(0,1) when k \to +\infty
Student distribution \mathcal T(k)
  • k: degree of freedom
  • Distrib of \tfrac{Z}{\sqrt{U/k}}
  • Z, U are independent and follow resp. \mathcal N(0,1) and a \chi^2(k)
Theorem

Assume X_i are iid \mathcal N(\mu_0, \sigma).

  • The test statistic \psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\hat \sigma} pivotal (indep. of \sigma).
  • It follows a Student distribution \mathcal T(n-1).

Sketch of Proof.

Remark that, the orthogonal projection of (X_1, \dots, X_n) on (1, \dots, 1) is equal to \overline X \cdot (1, \dots, 1). This is precisely because the empirical mean minimizes the quantity \tfrac{1}{n}\sum (X_i - \overline X)^2. Hence, the two vectors \overline X \cdot (1, \dots, 1) and (X_1 - \overline X, \dots, X_n - \overline X) are orthogonal. From the Cochran’s theorem, Orthogonality is equivalent to independence for Gaussian random variables, and we deduce that \overline X is independent of \hat \sigma^2. \tag*{$\blacksquare$}

The student tests are the same as Gaussian tests with known variance, except that we replace the quantiles of the Gaussian the quantiles of the Student distribution.

  • The quantiles of the Student distribution are close to the quantile of the Standard Gaussian when the degree of freedom k is large:
  • The quantiles are slightly larger when k is small. The Student Distribution has a slightly heavier tail.

Student T-test

  • Multiple VS multiple test problem X=(X_1, \dots, X_n): H_0: \{\mu_0,\sigma > 0\} \text{ or } H_1: \{\mu \neq \mu_0,\sigma > 0\} \;.

  • (Student) T-test statistic: \psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\hat \sigma(X)} \sim \mathcal T(n-1)

Testing Variance, Unknown Mean

  • We observe X=(X_1, \dots, X_{n_1}) iid \mathcal N(\mu, \sigma). \mu, \sigma are unknown. \sigma_0 is fixed.
Right-tailed test
  • H_0: \sigma \leq \sigma_0, H_1: \sigma > \sigma_0
  • \psi(X) = \frac{1}{\sigma_0^2}\sum (X_i - \overline X)^2
  • T(X) = \mathbf{1}\{\psi(X) > q_{1-\alpha}\}
  • q_{1-\alpha}: quantile(Chisq(n-1), 1-alpha)
  • pvalue: 1-cdf(Chisq(n-1), xobs)

Left-tailed test
  • H_0: \sigma \geq \sigma_0, H_1: \sigma < \sigma_0
  • \psi(X) = \frac{1}{\sigma_0^2}\sum (X_i - \overline X)^2
  • T(X) = \mathbf{1}\{\psi(X) < q_{\alpha}\}
  • q_{\alpha}: quantile(Chisq(n-1), alpha)
  • pvalue: cdf(Chisq(n-1), xobs)

Two Gaussian Populations

Testing Means, Known Variances

  • We observe (X_1, \dots, X_{n_1}) iid \mathcal N(\mu_1, \sigma_1^2) and (Y_1, \dots, Y_{n_2}) iid \mathcal N(\mu_1, \sigma_1^2).

  • \sigma_1, \sigma_2 are known, \mu_1, \mu_2 are unknown

  • Test Problem: H_0: \mu_1 = \mu_2 ~~~\text{or} ~~~H_1: \mu_1 \neq \mu_2

  • Idea: Normalize \overline X - \overline Y: \psi(X,Y)=\frac{\overline X - \overline Y}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}

  • Two-Tailed Test for Testing Variance: T(X,Y)=\left|\frac{\overline X - \overline Y}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\right| \geq t_{1-\alpha/2} \; ,

  • t_{1-\alpha/2} is the (1-\alpha/2)-quantile of a Gaussian distribution

Testing Variances, Unknown Means

  • We observe (X_1, \dots, X_{n_1}) iid \mathcal N(\mu_1, \sigma_1) and (Y_1, \dots, Y_{n_2}) iid \mathcal N(\mu_2, \sigma_2).
  • \sigma_1, \sigma_2, \mu_1, \mu_2 are unknown
  • Variance Testing Problem: H_0: \sigma_1 = \sigma_2 ~~~~ \text{ or } ~~~~ H_1: \sigma_1 \neq \sigma_2
  • F-Test Statistic of the Variances (ANOVA) \frac{\hat \sigma^2_1}{\hat \sigma_2^2} = \frac{\tfrac{1}{n_1-1}\sum_{i=1}^{n_1}(X_i-\overline X)^2}{\tfrac{1}{n_2-1}\sum_{i=1}^{n_2}(Y_i-\overline Y)^2}\; .

Fisher Distribution

Fisher Distribution \mathcal F(k_1,k_2)
  • (k_1, k_2): degrees of freedom
  • Distribution of \frac{U_1/k_1}{U_2/k_2}
  • Where U_1, U_2 are indep. and follow \chi^2(k_1), \chi^2(k_2). wiki
  • \mathcal F(k_1,k_2) \approx 1 + \sqrt{\frac{2}{k_1} + \frac{2}{k_2}}\mathcal N\left(0, 1\right) when k_1,k_2 \to +\infty
  • Example: \frac{Z_1^2+Z_2^2}{2Z_3^2} \sim \mathcal F(2,1) if Z_i \sim \mathcal N(0,1)
Proposition
  • \psi(X,Y)=\frac{\hat \sigma^2_1}{\hat \sigma_2^2} is independent of \mu_1, \mu_2, \sigma_1, \sigma_2. It is pivotal
  • It follow distribution \mathcal F(n_1-1, n_2-1)

  • Two-tailed test: \frac{\hat \sigma^2_1}{\hat \sigma_2^2} \not \in [t_{\alpha/2}, t_{1-\alpha/2}] ~~~\text{(quantile of Fisher)}

Testing Means, Equal Variances

  • We observe (X_1, \dots, X_{n_1}) iid \mathcal N(\mu_1, \sigma_1) and (Y_1, \dots, Y_{n_2}) iid \mathcal N(\mu_2, \sigma_2).
  • \sigma_1, \sigma_2, \mu_1, \mu_2 are unknown, but we know that \sigma_1=\sigma_2
  • Equality of Mean Testing Problem: H_0: \mu_1 = \mu_2 ~~~~ \text{ or } ~~~~ H_1: \mu_1 \neq \mu_2
  • Formally, H_0 = \{(\mu,\sigma, \mu, \sigma), \mu \in \mathbb R, \sigma > 0\}.
Student T-Test for two populations with equal variance
  • \hat \sigma^2 = \frac{1}{n_1 + n_2 - 2}\left(\sum_{i=1}^{n_1}(X_i - \overline X)^2 + \sum_{i=1}^{n_2}(Y_i - \overline Y)^2 \right)
  • Normalize \overline X - \overline Y:

\psi(X,Y) = \frac{\overline X - \overline Y}{\sqrt{\hat \sigma^2\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \sim \mathcal T(n_1+n_2 - 2) \; .

  • \psi(X,Y) is pivotal because \sigma_1 = \sigma_2.

Equality Means, Unequal Variances

  • We observe (X_1, \dots, X_{n_1}) iid \mathcal N(\mu_1, \sigma_1) and (Y_1, \dots, Y_{n_2}) iid \mathcal N(\mu_2, \sigma_2).
  • \sigma_1, \sigma_2, \mu_1, \mu_2 are unknown
  • Equality of Mean Testing Problem: H_0: \mu_1 = \mu_2 ~~~~ \text{ or } ~~~~ H_1: \mu_1 \neq \mu_2
  • Formally, H_0 = \{(\mu,\sigma_1, \mu, \sigma_2), \mu \in \mathbb R, \sigma_1, \sigma_2 > 0\}.
Student Welch Test Statistic

\psi(X, Y) = \frac{\overline X - \overline Y}{\sqrt{\frac{\hat \sigma_1^2}{n_1} + \frac{\hat \sigma_2^2}{n_2}}}

  • \psi(X,Y) is not pivotal
  • Gaussian approximation: \psi(X,Y) \approx \mathcal N(0,1) when n_1, n_2 \to \infty
  • Better approximation: Student Welch

Asymptotic Approximations

Central Limit Theorem

CLT
  • Let S_n = \sum_{i=1}^n X_i with (X_1, \dots, X_n) iid (L^2) then \frac{S_n - \mathbb E[S_n]}{\sqrt{\mathrm{Var}(S_n)}} \approx \mathcal N(0,1) \text{ when $n \to \infty$}
  • Equality when X_i’s are \mathcal N(\mu, \sigma)
  • Rule of thumb: n \geq 30
Example: Binomials
  • If p \in (0,1)
  • \frac{\mathrm{Bin}(n,p) - np}{\sqrt{np(1-p)}} \approx \mathcal N(0,1) when n \to \infty
  • n should be \gg \frac{1}{p}

Good Approx for (n=100, p=0.2)

Bad Approx for (n=100, p=0.01)

Proportion Test

  • We observe X \sim Bin(n_1, p_1) and Y \sim Bin(n_2, p_2).
  • n_1, n_2 are known but p_1, p_2 are unknown in (0,1)
  • H_0: p_1 = p_2 or H_1: p_1 \neq p_2
Test Statistic

\psi(X,Y) = \frac{\hat p_1 - \hat p_2}{\sqrt{\hat p ( 1-\hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \; .

  • \hat p_1 = X/n_1, \hat p_2 = X/n_2
  • \hat p = \frac{X+Y}{n_1+n_2}
  • If np_1, np_2 \gg 1: \psi(X) \sim \mathcal N(0,1)
  • We reject if |\psi(X,Y)| \geq t_{1-\alpha/2} (gaussian quantile)

Example (reference)

  • Question: “should we raise taxes on cigarettes to pay for a healthcare reform ?”
  • p_1, p_2: proportion of non-smokers or smokers willing to raise taxes
  • H_0: p_1=p_2 or H_1: p_1 > p_2
Non-Smokers Smokers Total
YES 351 41 392
NO 254 195 449
Total 605 154 800
  • \hat p_1 \approx 0.58, \hat p_2 \approx 0.21.
  • \psi(X,Y)= \frac{\hat p_1 - \hat p_2}{\sqrt{\hat p ( 1-\hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \approx 8.99
  • \mathbb P(\psi(X,Y) > 8.99) = 1-cdf(Normal(0,1), 8.99)