Gaussian Populations

One Gaussian Population

Definition and link with CLT

We observe $(X_1, \dots, X_n)$ iid real valued random variables.

CLT

Let $S_n = \sum_{i=1}^n X_i$ with $(X_1, \dots, X_n)$ iid ($L^2$) then \[ \frac{S_n - \mathbb E[S_n]}{\sqrt{\mathrm{Var}(S_n)}} \approx \mathcal N(0,1) \text{ when } n \to \infty \]
Rule of thumb: $n \geq 30$

Equality when $X_i$’s are Gaussian $\mathcal N(\mu, \sigma^2)$, that is \[ \mathbb P(X_1 \in [x,x+dx]) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}dx \]

Testing Mean with Known Variance

$X = (X_1, \dots, X_n)$, iid with distribution $\mathcal N(\mu, \sigma^2)$.

Test problems

\[ \begin{aligned} H_0: \mu = \mu_0 ~~~~ &\text{ or } ~~~ H_1: \mu > \mu_0 ~~~ \text{(right-tailed)}\\ H_0: \mu = \mu_0 ~~~ &\text{ or } ~~~ H_1: \mu < \mu_0 ~~~ \text{(left-tailed)}\\ H_0: \mu = \mu_0 ~~~ &\text{ or } ~~~ H_1: \mu \neq \mu_0 ~~~ \text{(two-tailed)}\\ \end{aligned} \]

Test statistic: \[ \psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} \; .\]
$\psi(X) \sim \mathcal N(0,1)$
Q1, Q2

Tests

\[ \begin{aligned} \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} > t_{1-\alpha} ~~~ \text{(right-tailed)}\\ \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} < t_{\alpha} ~~~ \text{(left-tailed)}\\ \left|\frac{\sqrt{n}(\overline X-\mu_0)}{\sigma}\right| > t_{1-\tfrac{\alpha}{2}}~~~ \text{(two-tailed)}\\ \end{aligned} \]

Why 0.05 and 1.96 ?

Fisher’s Quote

The value for which $p=0.05$, or 1 in 20, is 1.96 or nearly 2 ; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.

Testing Mean with Unknown Variance

Multiple VS multiple test problem: \[ H_0: \{\mu_0,\sigma > 0\} \text{ or } H_1: \{\mu \neq \mu_0,\sigma > 0\} \;. \]
$\psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma}$ no longer test statistic.
Idea: replace $\sigma$ by its estimator \[ \hat \sigma(X) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i - \mu_0)^2} \; .\]
This gives \[ \psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\hat \sigma} \; . \]
Is $\psi(X)$ pivotal under $H_0$ ? What is its distribution ?

Chi-Square and Student Distributions

Chi-squared distribution $\chi^2(k)$

Distrib of $\sum_{i=1}^k Z_i^2$ where the $Z_i$’s are iid $\mathcal N(0,1)$.
$\mathbb E[Z_i^4] - \mathbb E[Z_i^2]=2$
$k$: degree of freedom
$\chi^2(k) \sim k + \sqrt{2k}\mathcal N(0,1)$ when $k \to +\infty$

Student distribution $\mathcal T(k)$

Distrib of $\tfrac{Z}{\sqrt{U/k}}$ where $Z$, $U$ are independent and follow resp. $\mathcal N(0,1)$ and a $\chi^2(k)$
$k$: degree of freedom
$\mathcal T(k) \sim \mathcal N(0,1)$ when $k \to +\infty$

Theorem

Assume $X_i$ are iid $\mathcal N(\mu_0, \sigma^2)$.

The test statistic $\psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\hat \sigma}$ pivotal (indep. of $\sigma$).
It follows a Student Distribution $\mathcal T(n-1)$.

Proof idea: $\overline X \cdot (1, \dots, 1)$ and $(X_1 - \overline X, \dots, X_n - \overline X)$ are orthogonal in $\mathbb R^n$.

Student T-Test

Multiple VS multiple test problem $X=(X_1, \dots, X_n)$: \[ H_0: \{\mu_0,\sigma > 0\} \text{ or } H_1: \{\mu \neq \mu_0,\sigma > 0\} \;. \]
(Student) T-test statistic: \[\psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\hat \sigma(X)} \sim \mathcal T(n-1)\]

Testing Variance, Unknown Mean

We observe $X=(X_1, \dots, X_{n_1})$ iid $\mathcal N(\mu, \sigma^2)$. $\mu$, $\sigma$ are unknown. $\sigma_0$ is fixed and known.

Note

$H_0$: $\sigma \leq \sigma_0$, $H_1$: $\sigma > \sigma_0$
$\psi(X) = \frac{1}{\sigma_0^2}\sum_{i=1}^n (X_i - \overline X)^2$ Wooclap
$T(X) = \mathbf{1}\{\psi(X) > q_{1-\alpha}\}$
$q_{1-\alpha}$: quantile(Chisq(n-1), 1-alpha)
pvalue: 1-cdf(Chisq(n-1), xobs)

$H_0$: $\sigma \geq \sigma_0$, $H_1$: $\sigma < \sigma_0$
$\psi(X) = \frac{1}{\sigma_0^2}\sum_{i=1}^n (X_i - \overline X)^2$
$T(X) = \mathbf{1}\{\psi(X) < q_{\alpha}\}$
$q_{\alpha}$: quantile(Chisq(n-1), alpha)
pvalue: cdf(Chisq(n-1), xobs)

Two Gaussian Populations

Testing Means, Known Variances

We observe $(X_1, \dots, X_{n_1})$ iid $\mathcal N(\mu_1, \sigma_1^2)$ and $(Y_1, \dots, Y_{n_2})$ iid $\mathcal N(\mu_2, \sigma_2^2)$.
$\sigma_1$, $\sigma_2$ are known, $\mu_1$, $\mu_2$ are unknown
Test problem: $H_0: \mu_1 = \mu_2 ~~~\text{or} ~~~H_1: \mu_1 \neq \mu_2$
Idea: normalize $\overline X - \overline Y$: \[ \psi(X,Y)=\frac{\overline X - \overline Y}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}} \]
Two-tailed test for testing means: \[ T(X,Y)=\left|\frac{\overline X - \overline Y}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\right| \geq t_{1-\alpha/2} \; , \]
$t_{1-\alpha/2}$ is the $(1-\alpha/2)$-quantile of a Gaussian distribution

Example

Objective. Test if a new medication is efficient to lower cholesterol level
Experiment.
- Group A: $n_A = 45$ patients receiving the new medication
- Group B: $n_B = 50$ patients receiving a placebo
Test Problem.
- We observe $(X_1, \dots, X_{n_A})$ iid $\mathcal N(\mu_A,\sigma^2)$ and $(Y_1, \dots, Y_{n_B})$ iid $N(\mu_B,\sigma^2)$ the chol levels. $\sigma = 8$ mg/dL is known from calibration.
- $H_0: \mu_A = \mu_B$ VS $H_1: \mu_A < \mu_B$
Test Statistic. $\psi(X,Y)=\frac{\overline X - \overline Y}{\sqrt{\frac{\sigma^2}{n_1} + \frac{\sigma^2}{n_2}}}$
Data. $\overline X = 24.5$ mg/dL and $\overline Y = 21.3$ mg/dL. Hence $\psi(X,Y)= 5.5$.
Conclusion. Do not reject, and do not use this medication!

Testing Variances, Unknown Means

We observe $(X_1, \dots, X_{n})$ iid $\mathcal N(\mu_1, \sigma_1^2)$ and $(Y_1, \dots, Y_{n_2})$ iid $\mathcal N(\mu_2, \sigma_2^2)$.
$\sigma_1$, $\sigma_2$, $\mu_1$, $\mu_2$ are unknown
Variance Testing Problem: \[ H_0: \sigma_1 = \sigma_2 ~~~~ \text{ or } ~~~~ H_1: \sigma_1 \neq \sigma_2 \]
F-Test Statistic of the Variances (ANOVA) \[ \frac{\hat \sigma^2_1}{\hat \sigma_2^2} = \frac{\tfrac{1}{n_1-1}\sum_{i=1}^{n_1}(X_i-\overline X)^2}{\tfrac{1}{n_2-1}\sum_{i=1}^{n_2}(Y_i-\overline Y)^2}\; . \]

Fisher Distribution

Fisher distribution $\mathcal F(k_1,k_2)$

Distribution of $\frac{U_1/k_1}{U_2/k_2}$, where $U_1$, $U_2$ are indep. and follow $\chi^2(k_1)$, $\chi^2(k_2)$. wiki Wooclap
$\mathcal F(k_1,k_2) \approx 1 + \sqrt{\frac{2}{k_1} + \frac{2}{k_2}}\mathcal N\left(0, 1\right)$ when $k_1,k_2 \to +\infty$
$(k_1, k_2)$: degrees of freedom
Example: $\frac{Z_1^2+Z_2^2}{2Z_3^2} \sim \mathcal F(2,1)$ if $Z_i \sim \mathcal N(0,1)$

Proposition

$\psi(X,Y)=\frac{\hat \sigma^2_1}{\hat \sigma_2^2}$ is independent of $\mu_1$, $\mu_2$, $\sigma_1$, $\sigma_2$. It is pivotal
It follow distribution $\mathcal F(n_1-1, n_2-1)$

Two-tailed test: \[ \frac{\hat \sigma^2_1}{\hat \sigma_2^2} \not \in [t_{\alpha/2}, t_{1-\alpha/2}] ~~~\text{(quantile of Fisher)}\]

Testing Means, Equal Variances

We observe $(X_1, \dots, X_{n_1})$ iid $\mathcal N(\mu_1, \sigma_1^2)$ and $(Y_1, \dots, Y_{n_2})$ iid $\mathcal N(\mu_2, \sigma_2^2)$.
$\sigma_1$, $\sigma_2$, $\mu_1$, $\mu_2$ are unknown, but we know that $\sigma_1=\sigma_2$
Equality of mean testing problem: \[ H_0: \mu_1 = \mu_2 ~~~~ \text{ or } ~~~~ H_1: \mu_1 \neq \mu_2 \]
Formally, $H_0 = \{(\mu,\sigma, \mu, \sigma), \mu \in \mathbb R, \sigma > 0\}$.

Student T-Test for two populations with equal variance

$\hat \sigma^2 = \frac{1}{n_1 + n_2 - 2}\left(\sum_{i=1}^{n_1}(X_i - \overline X)^2 + \sum_{i=1}^{n_2}(Y_i - \overline Y)^2 \right)$
Normalize $\overline X - \overline Y$: \[\psi(X,Y) = \frac{\overline X - \overline Y}{\sqrt{\hat \sigma^2\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \sim \mathcal T(n_1+n_2 - 2) \; .\]
$\psi(X,Y)$ is pivotal because $\sigma_1 = \sigma_2$.

Equality Means, Unequal Variances

We observe $(X_1, \dots, X_{n_1})$ iid $\mathcal N(\mu_1, \sigma_1^2)$ and $(Y_1, \dots, Y_{n_2})$ iid $\mathcal N(\mu_2, \sigma_2^2)$.
$\sigma_1$, $\sigma_2$, $\mu_1$, $\mu_2$ are unknown
Equality of Mean Testing Problem: \[ H_0: \mu_1 = \mu_2 ~~~~ \text{ or } ~~~~ H_1: \mu_1 \neq \mu_2 \]
Formally, $H_0 = \{(\mu,\sigma_1, \mu, \sigma_2), \mu \in \mathbb R, \sigma_1, \sigma_2 > 0\}$.

Student Welch test statistic

\[\psi(X, Y) = \frac{\overline X - \overline Y}{\sqrt{\frac{\hat \sigma_1^2}{n_1} + \frac{\hat \sigma_2^2}{n_2}}}\]

Wooclap $\psi(X,Y)$ is not pivotal
Gaussian approximation: $\psi(X,Y) \approx \mathcal N(0,1)$ when $n_1, n_2 \to \infty$
Better approximation: Student Welch

Asymptotic Approximations

Central Limit Theorem

CLT

Let $S_n = \sum_{i=1}^n$ with $(X_1, \dots, X_n)$ iid ($L^2$) then \[ \frac{S_n - \mathbb E[S_n]}{\sqrt{\mathrm{Var}(S_n)}} \approx \mathcal N(0,1) \text{ when $n \to \infty$} \]
Equality when $X_i$’s are $\mathcal N(\mu, \sigma^2)$
Rule of thumb: $n \geq 30$

Example: binomials

If $p \in (0,1)$
$\frac{\mathrm{Bin}(n,p) - np}{\sqrt{np(1-p)}} \approx \mathcal N(0,1)$ when $n \to \infty$
$n$ should be $\gg \frac{1}{p}$

Good Approx for ($n=100$, $p=0.2$)

Bad Approx for ($n=100$, $p=0.01$)

Proportion Test

We observe $X \sim Bin(n_1, p_1)$ and $Y \sim Bin(n_2, p_2)$.
$n_1$, $n_2$ are known but $p_1$, $p_2$ are unknown in $(0,1)$
$H_0$: $p_1 = p_2$ or $H_1$: $p_1 \neq p_2$

Test Statistic

\[ \psi(X,Y) = \frac{\hat p_1 - \hat p_2}{\sqrt{\hat p ( 1-\hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \; .\]

$\hat p_1 = X/n_1$, $\hat p_2 = Y/n_2$
$\hat p = \frac{X+Y}{n_1+n_2}$ [Wooclap]
If $np_1, np_2 \gg 1$: $\psi(X) \sim \mathcal N(0,1)$
We reject if $|\psi(X,Y)| \geq t_{1-\alpha/2}$ (gaussian quantile)

Example (reference)

Question: “should we raise taxes on cigarettes to pay for a healthcare reform ?”
$p_1$, $p_2$: proportion of non-smokers or smokers willing to raise taxes
$H_0$: $p_1=p_2$ or $H_1$: $p_1 > p_2$

	Non-Smokers	Smokers	Total
YES	351	41	392
NO	254	195	449
Total	605	154	800

$\hat p_1 = \overline X= \approx 0.58$, $\hat p_2=\overline Y= \approx 0.21$.
$\psi(X,Y)= \frac{\hat p_1 - \hat p_2}{\sqrt{\hat p ( 1-\hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \approx 8.99$
$\mathbb P(\psi(X,Y) > 8.99)$ = 1-cdf(Normal(0,1), 8.99)