Gaussian Populations

Outline

  • One gaussian population (one mean, one variance):
    • Testing mean with known variance
    • Testing mean with unknown variance
    • Testing variance with unknown mean
  • Two gaussian population (two means, two variances)
    • Testing difference of means with known variances

Previous: usual distributions

Next: goodness of fit

One Gaussian Population

Testing Mean with Known Variance

\(X = (X_1, \dots, X_n)\), iid with distribution \(\mathcal N(\mu, \sigma^2)\).

Test Problems

\(H_0: \mu = \mu_0 ~~~~ \text{ or } ~~~ H_1: \mu > \mu_0\) (right-tailed)

\(H_0: \mu = \mu_0 ~~~ \text{ or } ~~~ H_1: \mu < \mu_0\) (left-tailed)

\(H_0: \mu = \mu_0 ~~~ \text{ or } ~~~ H_1: \mu \neq \mu_0\) (two-tailed)

  • We want to test the mean \(\mu = \mu_0\). Natural idea: use \(\overline X\)
  • But \(\overline X \sim \mathcal N(\mu_0, \frac{\sigma^2}{n})\) under \(H_0\). So we normalize to get an \(\mathcal N(0,1)\)

Test Statistic

Test statistic:

\[\psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma}\]

Under \(H_0\), \(\psi(X) \sim \mathcal N(0,1)\)

Warning

This is a test statistic because \(\mu_0\) and \(\sigma\) are known here. It wouldn’t be otherwise.

Tests

Test Critical Regions

\(\mathcal R\): \(\frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} > t_{1-\alpha}\) (right-tailed)

\(\mathcal R\): \(\frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} < t_{\alpha}\) (left-tailed)

\(\mathcal R\): \(\left|\frac{\sqrt{n}(\overline X-\mu_0)}{\sigma}\right| > t_{1-\tfrac{\alpha}{2}}\) (two-tailed)

Illustration

Example

A machine fills bottles with a nominal volume \(\mu_0 = 500\) ml. The filling volume is known to follow \(\mathcal{N}(\mu, \sigma^2)\) with \(\sigma = 5\) ml. On a sample of \(n = 25\) bottles, we observe \(\overline{x} = 498.1\) ml. Is the machine under-filling?

  • \(H_0: \mu = 500\) vs \(H_1: \mu < 500\) (left-tailed)
  • Test statistic: \(\tfrac{\sqrt{n}(\overline X - \mu_0)}{\sigma} = \tfrac{\sqrt{25}(498.1 - 500)}{5} = -1.9\)
  • Threshold: \(t_{\alpha} =\) quantile(Normal(0,1), 0.05) = -1.645
  • \(-1.9 < -1.645\): reject \(H_0\) at level \(5\%\)
  • p-value: cdf(Normal(0,1), -1.9) = 0.029

Testing Mean with Unknown Variance

We observe \((X_1, \dots, X_n)\) iid \(\mathcal N(\mu, \sigma^2)\) where \(\mu\) and \(\sigma\) are unknown.

We fix \(\mu_0\) as a known quantity

we want to test if \(\mu = \mu_0\).

Multiple VS multiple test problem:

\[ H_0: \{\mu_0,\sigma > 0\} \text{ or } H_1: \{\mu \neq \mu_0,\sigma > 0\} \;. \]

Warning

\(\psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma}\) no longer test statistic.

Idea: replace \(\sigma\) by its estimator \[ \hat \sigma(X) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i - \mu_0)^2} \; .\]

Student T-Test

\[ H_0: \{\mu_0,\sigma > 0\} \text{ or } H_1: \{\mu \neq \mu_0,\sigma > 0\} \;. \]

(Student) T-test statistic: \[T(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\hat \sigma(X)}\]

Proposition: distribution of T under \(H_0\)

\(T(X)\) is a pivotal test statistic.

Under \(H_0\), \(\psi(X)\sim \mathcal T(n-1)\)

Proof

Take \(E = \operatorname{Span}(\mathbf{1})\) and \(F = E^\perp\). Setting \(Y_i = \frac{X_i - \mu}{\sigma}\):

\[ \|\Pi_E Y\|^2 = n\overline{Y}^2 = \frac{n(\overline{X}-\mu)^2}{\sigma^2} \sim \chi^2(1) \]

\[ \|\Pi_F Y\|^2 = \sum_{i=1}^n (Y_i - \overline{Y})^2 = \frac{(n-1)\hat{\sigma}^2}{\sigma^2} \sim \chi^2(n-1) \]

and the two are independent, which is precisely what is needed for the Student \(T\)-statistic.

Student Distribution

Testing Variance, Unknown Mean

We observe \(X=(X_1, \dots, X_{n_1})\) iid \(\mathcal N(\mu, \sigma^2)\). \(\mu\), \(\sigma\) are unknown. \(\sigma_0\) is fixed and known.

We want to test if \(\sigma > \sigma_0\), or \(\sigma < \sigma_0\)

Right-Tailed Fisher

\(H_0\): \(\sigma \leq \sigma_0\), \(H_1\): \(\sigma > \sigma_0\)

Test statistic:

\(\psi(X) = \frac{1}{\sigma_0^2}\sum_{i=1}^n (X_i - \overline X)^2\) Wooclap

Test:

\(T(X) = \mathbf{1}\{\psi(X) > q_{1-\alpha}\}\) with \(q_{1-\alpha}\) \((1-\alpha)\)-quantile of \(\chi^2(n-1)\)

Rejection Region: \([q_{1-\alpha}, +\infty)\)

Critical Region: \(\{(x_1, \dots, x_n) \in \mathbb R^n: ~ \psi(x_1, \dots, x_n) > q_{1-\alpha}\}\)

Property of Fisher Test

Proposition

Fix \(t>0\). Under \(H_0\), that is if \(\sigma \leq \sigma_0\)

\[P_{\mu, \sigma}(\psi(X) > t) \leq P_{\mu, \sigma_0}(\psi(X) > t) = P(\chi^2(n-1) > t)\]

\(~\)

In practice:

\(q_{1-\alpha}\): quantile(Chisq(n-1), 1-alpha)

pvalue: 1-cdf(Chisq(n-1), xobs)

Proof.

Under \(P_{\mu,\sigma}\), the random variable \(Z = \frac{1}{\sigma^2}\sum_{i=1}^n (X_i - \bar{X})^2 \sim \chi^2(n-1)\).

\(\psi(X) = \frac{1}{\sigma_0^2}\sum_{i=1}^n (X_i - \bar{X})^2 = \frac{\sigma^2}{\sigma_0^2}\, Z.\)

Hence, \(P_{\mu,\sigma}(\psi(X) > t) = P\!\left(Z > \frac{\sigma_0^2}{\sigma^2}\, t\right)\\ \leq P(Z > t) = P(\chi^2(n-1) > t),\)

Illustration

Left-tailed Fisher

\(H_0\): \(\sigma \geq \sigma_0\), \(H_1\): \(\sigma \leq \sigma_0\)

\(\psi(X) = \frac{1}{\sigma_0^2}\sum_{i=1}^n (X_i - \overline X)^2\)

\(T(X) = \mathbf{1}\{\psi(X) < q_{\alpha}\}\)

\(q_{\alpha}\): quantile(Chisq(n-1), alpha)

pvalue: cdf(Chisq(n-1), xobs)

Two Gaussian Populations

Testing Means, Known Variances

We observe \((X_1, \dots, X_{n_1})\) iid \(\mathcal N(\mu_1, \sigma_1^2)\) and \((Y_1, \dots, Y_{n_2})\) iid \(\mathcal N(\mu_2, \sigma_2^2)\).

\(\sigma_1\), \(\sigma_2\) are known, \(\mu_1\), \(\mu_2\) are unknown

Test problem:

\(H_0: \mu_1 = \mu_2 ~~~\text{VS} ~~~H_1: \mu_1 \neq \mu_2\)

Warning

We can’t use \(\mu_1\) or \(\mu_2\) because they are unknown

Idea

We want to use \(\overline X - \overline Y\) since \(\mathbb E[\overline X] - \mathbb E[\overline Y] = \mu_1 - \mu_2\).

But what is \(\mathbb V(\overline X - \overline Y)\) under \(H_0\)?

\(\mathbb V(\overline X - \overline Y) = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\)

We can use \(\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\) because \(\sigma_1\), \(\sigma_2\) are known here.

Test Statistic

Test Statistic:

\[ \psi(X,Y)=\frac{\overline X - \overline Y}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}} \]

Property

Under \(H_0\), \(\psi(X,Y)\) follows a \(\mathcal N(0, 1)\)

Test

Two-tailed test:

\[ T(X,Y)=\mathbf 1\left\{|\psi(X, Y)| \geq t_{1-\alpha/2}\right\} \; , \]

\(t_{1-\alpha/2}\) is the \((1-\alpha/2)\)-quantile of a Gaussian distribution

We can also test \(\mu_1 < \mu_2\) or \(\mu_1 > \mu_2\).

For that, remove absolute value and take \(t_\alpha\) or \(t_{1-\alpha}\).

Example

Objective. Test if a new medication is efficient to lower cholesterol level

Experiment.

  • Group A: \(n_A = 45\) patients receiving the new medication
  • Group B: \(n_B = 50\) patients receiving a placebo
  • We observe \((X_1, \dots, X_{n_A})\) iid \(\mathcal N(\mu_A,\sigma^2)\) and \((Y_1, \dots, Y_{n_B})\) iid \(N(\mu_B,\sigma^2)\) the chol levels. \(\sigma = 8\) mg/dL is known from calibration.

Example (Following)

Test Problem.

  • \(H_0: \mu_A = \mu_B\) VS \(H_1: \mu_A < \mu_B\)

  • Test Statistic. \(\psi(X,Y)=\frac{\overline X - \overline Y}{\sqrt{\frac{\sigma^2}{n_1} + \frac{\sigma^2}{n_2}}}\)

  • Distribution under \(H_0\): \(\psi(X,Y) \sim \mathcal N(0,1)\)

  • Data. \(\overline X = 24.5\) mg/dL and \(\overline Y = 21.3\) mg/dL. Hence \(\psi(X,Y)= 5.5\).

  • p-value. \(\mathbb P(\psi(X, Y) \leq 5.5) \approx 1\) (\(P\) well defined under \(H_0\)!)

  • Conclusion. Do not reject, and do not use this medication!

Testing Variances, Unknown Means

We observe \(X=(X_1, \dots, X_{n})\) iid \(\mathcal N(\mu_1, \sigma_1^2)\) and \(Y=(Y_1, \dots, Y_{n_2})\) iid \(\mathcal N(\mu_2, \sigma_2^2)\). We also assume that \(X\) and \(Y\) are independent

\(\sigma_1\), \(\sigma_2\), \(\mu_1\), \(\mu_2\) are unknown here

Testing Problem:

\(H_0: \sigma_1 = \sigma_2 ~~~~ \text{ or } ~~~~ H_1: \sigma_1 \neq \sigma_2\)

Idea

\(H_0: \sigma_1 = \sigma_2 ~~~~ \text{ or } ~~~~ H_1: \sigma_1 \neq \sigma_2\)

We can’t use \(\sigma_1\), \(\sigma_2\) directly because they are unknown.

We estimate them:

\(\hat \sigma^2_1 = \tfrac{1}{n_1-1}\sum_{i=1}^{n_1}(X_i-\overline X)^2\) \(\hat \sigma^2_2 = \tfrac{1}{n_2-1}\sum_{i=1}^{n_2}(Y_i-\overline Y)^2\)

These are unbiased estimators since \(\mathbb E[\hat \sigma_1^2]= \sigma_1^2\) and \(\mathbb E[\hat \sigma_2^2]= \sigma_2^2\)

F-Test Statistic

F-Test Statistic

The F-Test Statistic of the Variances (ANOVA) is \[ \psi(X,Y)=\frac{\hat \sigma^2_1}{\hat \sigma_2^2} = \frac{\tfrac{1}{n_1-1}\sum_{i=1}^{n_1}(X_i-\overline X)^2}{\tfrac{1}{n_2-1}\sum_{i=1}^{n_2}(Y_i-\overline Y)^2}\; . \]

Distribution

Distribution of F-Test Statistic

Under the distribution given by parameters \(\mu_1, \mu_2, \sigma_1, \sigma_2\), \(\psi(X,Y)=\frac{\hat \sigma^2_1}{\hat \sigma_2^2}\) has distribution \(\frac{\sigma^2_1}{\sigma_2^2} \mathcal F(n_1-1, n_2-1)\)

This distribution is unknown under \(H_1\), but under \(H_0\), \(\sigma_1=\sigma_2\) so it is just \(\mathcal F(n_1-1, n_2-1)\)

Proof

Since \(X_1,\dots,X_{n_1}\overset{iid}{\sim}\mathcal{N}(\mu_1,\sigma_1^2)\), we have \((n_1-1)\hat\sigma_1^2/\sigma_1^2\sim\chi^2(n_1-1)\), and similarly \((n_2-1)\hat\sigma_2^2/\sigma_2^2\sim\chi^2(n_2-1)\), independently. Then

\[\psi(X,Y)=\frac{\hat\sigma_1^2}{\hat\sigma_2^2}=\frac{\sigma_1^2}{\sigma_2^2}\cdot\frac{\hat\sigma_1^2/\sigma_1^2}{\hat\sigma_2^2/\sigma_2^2}\\ =\frac{\sigma_1^2}{\sigma_2^2}\cdot\frac{\chi^2(n_1-1)/(n_1-1)}{\chi^2(n_2-1)/(n_2-1)}\sim\frac{\sigma_1^2}{\sigma_2^2}\,\mathcal{F}(n_1-1,n_2-1),\]

by definition of the F-distribution. \(\blacksquare\)

Testing Means, Equal Variances

We observe \((X_1, \dots, X_{n_1})\) iid \(\mathcal N(\mu_1, \sigma_1^2)\) and \((Y_1, \dots, Y_{n_2})\) iid \(\mathcal N(\mu_2, \sigma_2^2)\).

\(\sigma_1\), \(\sigma_2\), \(\mu_1\), \(\mu_2\) are unknown, but we know that \(\sigma_1=\sigma_2\)

Equality of mean testing problem:

\[ H_0: \mu_1 = \mu_2 ~~~~ \text{ or } ~~~~ H_1: \mu_1 \neq \mu_2 \]

Formally, \(H_0 = \{(\mu,\sigma, \mu, \sigma), \mu \in \mathbb R, \sigma > 0\}\).

Idea for Testing Means

Use again \(\overline X - \overline Y\) (expectation \(\mu_1 - \mu_2\))

What is its (unknown) variance ?

\(\sigma_1 = \sigma_2 = \sigma\) so we have

\(\mathbb V(\overline X - \overline Y) = \sigma^2(\frac{1}{n_1} + \frac{1}{n_2})\)

Warning

We can’t use \(\sigma\) to normalize because it is unknown !!!

We have to estimate it: \(\hat \sigma = \frac{1}{n_1 + n_2 - 2}\left(\sum_{i=1}^{n_1}(X_i - \overline X)^2 + \sum_{i=1}^{n_2}(Y_i - \overline Y)^2 \right)\)

Test for Equality of Means (Equal Variance)

Student T-Test for two populations with equal variance

  • \(\hat \sigma^2 = \frac{1}{n_1 + n_2 - 2}\left(\sum_{i=1}^{n_1}(X_i - \overline X)^2 + \sum_{i=1}^{n_2}(Y_i - \overline Y)^2 \right)\)

  • Normalize \(\overline X - \overline Y\): \[\psi(X,Y) = \frac{\overline X - \overline Y}{\sqrt{\hat \sigma^2\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \sim \mathcal T(n_1+n_2 - 2) \; .\]

  • \(\psi(X,Y)\) is pivotal because \(\sigma_1 = \sigma_2\).

Equality Means, Unequal Variances

We observe \((X_1, \dots, X_{n_1})\) iid \(\mathcal N(\mu_1, \sigma_1^2)\) and \((Y_1, \dots, Y_{n_2})\) iid \(\mathcal N(\mu_2, \sigma_2^2)\)

\(\sigma_1\), \(\sigma_2\), \(\mu_1\), \(\mu_2\) are unknown

Equality of Mean Testing Problem:

\[ H_0: \mu_1 = \mu_2 ~~~~ \text{ or } ~~~~ H_1: \mu_1 \neq \mu_2 \]

Formally:

\(\Theta_0 = \{(\mu,\sigma_1, \mu, \sigma_2), \mu \in \mathbb R, \sigma_1, \sigma_2 > 0\}\).

Student Welch Test

Student Welch test statistic

\[\psi(X, Y) = \frac{\overline X - \overline Y}{\sqrt{\frac{\hat \sigma_1^2}{n_1} + \frac{\hat \sigma_2^2}{n_2}}}\]

  • Wooclap \(\psi(X,Y)\) is not pivotal
  • Gaussian approximation: \(\psi(X,Y) \approx \mathcal N(0,1)\) when \(n_1, n_2 \to \infty\)
  • Better approximation: Student Welch

Asymptotic Approximations

General Principle

We observe \((X_1, \dots, X_{n_1})\) and/or \((Y_1, \dots, Y_{n_2})\) and assume that the observations are independent

\(\mathbb E[X_i] = \mu_1\), \(\mathbb E[Y_i] = \mu_2\), variance \(\sigma_1^2\) and \(\sigma_2^2\).

Event if the \(X_i\) are not standard Gaussian, we can approximate e.g. \(\sqrt{\tfrac{n_1}{\sigma_1^2}}(\overline X - \mu_1)\) by a \(\mathcal N(0,1)\) using the CLT.

Insight: centered and normalized variables always look like gaussians under independency assumption.

Hence, we can compute approximate p-value/rejection regions.

Proportion Test

We observe \(X \sim Bin(n_1, p_1)\) and \(Y \sim Bin(n_2, p_2)\).

Warning

Here, X is not a vector, but an integer!!

\(n_1\), \(n_2\) are known but \(p_1\), \(p_2\) are unknown in \((0,1)\)

\(H_0\): \(p_1 = p_2\) or \(H_1\): \(p_1 \neq p_2\)

Idea: use \(X-Y\), because \(E[X - Y] = p_1 - p_2\). What is its variance ?

notation: \(X/n_1\) is an estimator of \(p_1\) so we write \(\hat p_1 = X/n_1\).

Test Statistic

Test Statistic

\[ \psi(X,Y) = \frac{\hat p_1 - \hat p_2}{\sqrt{\hat p ( 1-\hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \; .\]

  • \(\hat p_1 = X/n_1\), \(\hat p_2 = Y/n_2\)
  • \(\hat p = \frac{X+Y}{n_1+n_2}\) [Wooclap]
  • If \(np_1, np_2 \gg 1\): \(\psi(X) \sim \mathcal N(0,1)\)
  • We reject if \(|\psi(X,Y)| \geq t_{1-\alpha/2}\) (gaussian quantile)

Example

Poll: “should we raise taxes on cigarettes to pay for a healthcare reform ?”

Question: Are non-smoker on average more in favor of tax raise?

Observations:

Non-Smokers Smokers Total
YES 351 41 392
NO 254 195 449
Total 605 154 800

Description

Data description: we observe \(X\) and \(Y\) the number of non-smokers (resp. smokers) willing to raise tax, among a population of \(n_1\) non-smokers (resp. \(n_2\) smokers).

Alternative description: we observe \((X_1, \dots, X_{n_1})\) and \((Y_1, \dots, Y_{n_2})\) where \(X_i\) (resp. \(Y_i\)) is equal to \(1\) if and only if non-smoker \(i\) (resp. smoker \(i\)) wants a tax raise.

Formulation of the problem

Assumption: We assume independency and that \(X \sim \mathcal B(n_1, p_1)\) and \(Y \sim \mathcal B(n_2, p_2)\) for unknown probabilities \(p_1\), \(p_2\)

(Or in alternative description): \(X_i\), \(Y_i\) are independent and bernoullis of parameters \(p_1\), \(p_2\). We denote \(X = \sum_{i=1}^n X_i\) and \(Y= \sum_{i=1}^n Y_i\).

Problem: We want to test

\(H_0: p_1=p_2\) VS \(H_1: p_1 > p_2\)

Resolution

\(p_1\), \(p_2\): proportion of non-smokers or smokers willing to raise taxes

\(H_0\): \(p_1=p_2\) or \(H_1\): \(p_1 > p_2\)

  • \(\hat p_1 = \overline X= \approx 0.58\), \(\hat p_2=\overline Y= \approx 0.21\).
  • \(\psi(X,Y)= \frac{\hat p_1 - \hat p_2}{\sqrt{\hat p ( 1-\hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \approx 8.99\)
  • \(\mathbb P(\psi(X,Y) > 8.99)\) = 1-cdf(Normal(0,1), 8.99)

Previous: usual distributions

Next: goodness of fit