Gaussian Populations

One Gaussian Population

Testing Mean with Known Variance

  • \(X = (X_1, \dots, X_n)\), iid with distribution \(\mathcal N(\mu, \sigma^2)\).

Test problems

\[ \begin{aligned} H_0: \mu = \mu_0 ~~~~ &\text{ or } ~~~ H_1: \mu > \mu_0 ~~~ \text{(right-tailed)}\\ H_0: \mu = \mu_0 ~~~ &\text{ or } ~~~ H_1: \mu < \mu_0 ~~~ \text{(left-tailed)}\\ H_0: \mu = \mu_0 ~~~ &\text{ or } ~~~ H_1: \mu \neq \mu_0 ~~~ \text{(two-tailed)}\\ \end{aligned} \]

  • Test statistic: \[ \psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} \; .\]

  • \(\psi(X) \sim \mathcal N(0,1)\)

  • Q1, Q2

Tests

\[ \begin{aligned} \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} > t_{1-\alpha} ~~~ \text{(right-tailed)}\\ \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma} < t_{\alpha} ~~~ \text{(left-tailed)}\\ \left|\frac{\sqrt{n}(\overline X-\mu_0)}{\sigma}\right| > t_{1-\tfrac{\alpha}{2}}~~~ \text{(two-tailed)}\\ \end{aligned} \]

Why 0.05 and 1.96 ?

Fisher’s Quote

The value for which \(p=0.05\), or 1 in 20, is 1.96 or nearly 2 ; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.

Testing Mean with Unknown Variance

  • Multiple VS multiple test problem: \[ H_0: \{\mu_0,\sigma > 0\} \text{ or } H_1: \{\mu \neq \mu_0,\sigma > 0\} \;. \]

  • \(\psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\sigma}\) no longer test statistic.

  • Idea: replace \(\sigma\) by its estimator \[ \hat \sigma(X) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i - \mu_0)^2} \; .\]

  • This gives \[ \psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\hat \sigma} \; . \]

  • Is \(\psi(X)\) pivotal under \(H_0\) ? What is its distribution ?

Chi-Square and Student Distributions

Chi-squared distribution \(\chi^2(k)\)

  • Distrib of \(\sum_{i=1}^k Z_i^2\) where the \(Z_i\)’s are iid \(\mathcal N(0,1)\).
  • \(\mathbb E[Z_i^4] - \mathbb E[Z_i^2]=2\)
  • \(k\): degree of freedom
  • \(\chi^2(k) \sim k + \sqrt{2k}\mathcal N(0,1)\) when \(k \to +\infty\)

Student distribution \(\mathcal T(k)\)

  • Distrib of \(\tfrac{Z}{\sqrt{U/k}}\) where \(Z\), \(U\) are independent and follow resp. \(\mathcal N(0,1)\) and a \(\chi^2(k)\)
  • \(k\): degree of freedom
  • \(\mathcal T(k) \sim \mathcal N(0,1)\) when \(k \to +\infty\)

Theorem

Assume \(X_i\) are iid \(\mathcal N(\mu_0, \sigma^2)\).

  • The test statistic \(\psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\hat \sigma}\) pivotal (indep. of \(\sigma\)).
  • It follows a Student Distribution \(\mathcal T(n-1)\).
  • Proof idea: \(\overline X \cdot (1, \dots, 1)\) and \((X_1 - \overline X, \dots, X_n - \overline X)\) are orthogonal in \(\mathbb R^n\).

Student T-Test

  • Multiple VS multiple test problem \(X=(X_1, \dots, X_n)\): \[ H_0: \{\mu_0,\sigma > 0\} \text{ or } H_1: \{\mu \neq \mu_0,\sigma > 0\} \;. \]

  • (Student) T-test statistic: \[\psi(X) = \frac{\sqrt{n}(\overline X-\mu_0)}{\hat \sigma(X)} \sim \mathcal T(n-1)\]

Testing Variance, Unknown Mean

  • We observe \(X=(X_1, \dots, X_{n_1})\) iid \(\mathcal N(\mu, \sigma^2)\). \(\mu\), \(\sigma\) are unknown. \(\sigma_0\) is fixed and known.

Note

  • \(H_0\): \(\sigma \leq \sigma_0\), \(H_1\): \(\sigma > \sigma_0\)
  • \(\psi(X) = \frac{1}{\sigma_0^2}\sum_{i=1}^n (X_i - \overline X)^2\) Wooclap
  • \(T(X) = \mathbf{1}\{\psi(X) > q_{1-\alpha}\}\)
  • \(q_{1-\alpha}\): quantile(Chisq(n-1), 1-alpha)
  • pvalue: 1-cdf(Chisq(n-1), xobs)

  • \(H_0\): \(\sigma \geq \sigma_0\), \(H_1\): \(\sigma < \sigma_0\)
  • \(\psi(X) = \frac{1}{\sigma_0^2}\sum_{i=1}^n (X_i - \overline X)^2\)
  • \(T(X) = \mathbf{1}\{\psi(X) < q_{\alpha}\}\)
  • \(q_{\alpha}\): quantile(Chisq(n-1), alpha)
  • pvalue: cdf(Chisq(n-1), xobs)

Two Gaussian Populations

Testing Means, Known Variances

  • We observe \((X_1, \dots, X_{n_1})\) iid \(\mathcal N(\mu_1, \sigma_1^2)\) and \((Y_1, \dots, Y_{n_2})\) iid \(\mathcal N(\mu_2, \sigma_2^2)\).

  • \(\sigma_1\), \(\sigma_2\) are known, \(\mu_1\), \(\mu_2\) are unknown

  • Test problem: \(H_0: \mu_1 = \mu_2 ~~~\text{or} ~~~H_1: \mu_1 \neq \mu_2\)

  • Idea: normalize \(\overline X - \overline Y\): \[ \psi(X,Y)=\frac{\overline X - \overline Y}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}} \]

  • Two-tailed test for testing means: \[ T(X,Y)=\left|\frac{\overline X - \overline Y}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\right| \geq t_{1-\alpha/2} \; , \]

  • \(t_{1-\alpha/2}\) is the \((1-\alpha/2)\)-quantile of a Gaussian distribution

Example

  • Objective. Test if a new medication is efficient to lower cholesterol level

  • Experiment.

    • Group A: \(n_A = 45\) patients receiving the new medication
    • Group B: \(n_B = 50\) patients receiving a placebo
  • Test Problem.

    • We observe \((X_1, \dots, X_{n_A})\) iid \(\mathcal N(\mu_A,\sigma^2)\) and \((Y_1, \dots, Y_{n_B})\) iid \(N(\mu_B,\sigma^2)\) the chol levels. \(\sigma = 8\) mg/dL is known from calibration.
    • \(H_0: \mu_A = \mu_B\) VS \(H_1: \mu_A < \mu_B\)
  • Test Statistic. \(\psi(X,Y)=\frac{\overline X - \overline Y}{\sqrt{\frac{\sigma^2}{n_1} + \frac{\sigma^2}{n_2}}}\)

  • Data. \(\overline X = 24.5\) mg/dL and \(\overline Y = 21.3\) mg/dL. Hence \(\psi(X,Y)= 5.5\).

  • Conclusion. Do not reject, and do not use this medication!

Testing Variances, Unknown Means

  • We observe \((X_1, \dots, X_{n})\) iid \(\mathcal N(\mu_1, \sigma_1^2)\) and \((Y_1, \dots, Y_{n_2})\) iid \(\mathcal N(\mu_2, \sigma_2^2)\).
  • \(\sigma_1\), \(\sigma_2\), \(\mu_1\), \(\mu_2\) are unknown
  • Variance Testing Problem: \[ H_0: \sigma_1 = \sigma_2 ~~~~ \text{ or } ~~~~ H_1: \sigma_1 \neq \sigma_2 \]
  • F-Test Statistic of the Variances (ANOVA) \[ \frac{\hat \sigma^2_1}{\hat \sigma_2^2} = \frac{\tfrac{1}{n_1-1}\sum_{i=1}^{n_1}(X_i-\overline X)^2}{\tfrac{1}{n_2-1}\sum_{i=1}^{n_2}(Y_i-\overline Y)^2}\; . \]

Fisher Distribution

Fisher distribution \(\mathcal F(k_1,k_2)\)

  • Distribution of \(\frac{U_1/k_1}{U_2/k_2}\), where \(U_1\), \(U_2\) are indep. and follow \(\chi^2(k_1)\), \(\chi^2(k_2)\). wiki Wooclap
  • \(\mathcal F(k_1,k_2) \approx 1 + \sqrt{\frac{2}{k_1} + \frac{2}{k_2}}\mathcal N\left(0, 1\right)\) when \(k_1,k_2 \to +\infty\)
  • \((k_1, k_2)\): degrees of freedom
  • Example: \(\frac{Z_1^2+Z_2^2}{2Z_3^2} \sim \mathcal F(2,1)\) if \(Z_i \sim \mathcal N(0,1)\)

Proposition

  • \(\psi(X,Y)=\frac{\hat \sigma^2_1}{\hat \sigma_2^2}\) is independent of \(\mu_1\), \(\mu_2\), \(\sigma_1\), \(\sigma_2\). It is pivotal
  • It follow distribution \(\mathcal F(n_1-1, n_2-1)\)

  • Two-tailed test: \[ \frac{\hat \sigma^2_1}{\hat \sigma_2^2} \not \in [t_{\alpha/2}, t_{1-\alpha/2}] ~~~\text{(quantile of Fisher)}\]

Testing Means, Equal Variances

  • We observe \((X_1, \dots, X_{n_1})\) iid \(\mathcal N(\mu_1, \sigma_1^2)\) and \((Y_1, \dots, Y_{n_2})\) iid \(\mathcal N(\mu_2, \sigma_2^2)\).
  • \(\sigma_1\), \(\sigma_2\), \(\mu_1\), \(\mu_2\) are unknown, but we know that \(\sigma_1=\sigma_2\)
  • Equality of mean testing problem: \[ H_0: \mu_1 = \mu_2 ~~~~ \text{ or } ~~~~ H_1: \mu_1 \neq \mu_2 \]
  • Formally, \(H_0 = \{(\mu,\sigma, \mu, \sigma), \mu \in \mathbb R, \sigma > 0\}\).

Student T-Test for two populations with equal variance

  • \(\hat \sigma^2 = \frac{1}{n_1 + n_2 - 2}\left(\sum_{i=1}^{n_1}(X_i - \overline X)^2 + \sum_{i=1}^{n_2}(Y_i - \overline Y)^2 \right)\)

  • Normalize \(\overline X - \overline Y\): \[\psi(X,Y) = \frac{\overline X - \overline Y}{\sqrt{\hat \sigma^2\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \sim \mathcal T(n_1+n_2 - 2) \; .\]

  • \(\psi(X,Y)\) is pivotal because \(\sigma_1 = \sigma_2\).

Equality Means, Unequal Variances

  • We observe \((X_1, \dots, X_{n_1})\) iid \(\mathcal N(\mu_1, \sigma_1^2)\) and \((Y_1, \dots, Y_{n_2})\) iid \(\mathcal N(\mu_2, \sigma_2^2)\).
  • \(\sigma_1\), \(\sigma_2\), \(\mu_1\), \(\mu_2\) are unknown
  • Equality of Mean Testing Problem: \[ H_0: \mu_1 = \mu_2 ~~~~ \text{ or } ~~~~ H_1: \mu_1 \neq \mu_2 \]
  • Formally, \(H_0 = \{(\mu,\sigma_1, \mu, \sigma_2), \mu \in \mathbb R, \sigma_1, \sigma_2 > 0\}\).

Student Welch test statistic

\[\psi(X, Y) = \frac{\overline X - \overline Y}{\sqrt{\frac{\hat \sigma_1^2}{n_1} + \frac{\hat \sigma_2^2}{n_2}}}\]

  • Wooclap \(\psi(X,Y)\) is not pivotal
  • Gaussian approximation: \(\psi(X,Y) \approx \mathcal N(0,1)\) when \(n_1, n_2 \to \infty\)
  • Better approximation: Student Welch

Asymptotic Approximations

Central Limit Theorem

CLT

  • Let \(S_n = \sum_{i=1}^n\) with \((X_1, \dots, X_n)\) iid (\(L^2\)) then \[ \frac{S_n - \mathbb E[S_n]}{\sqrt{\mathrm{Var}(S_n)}} \approx \mathcal N(0,1) \text{ when $n \to \infty$} \]
  • Equality when \(X_i\)’s are \(\mathcal N(\mu, \sigma^2)\)
  • Rule of thumb: \(n \geq 30\)

Example: binomials

  • If \(p \in (0,1)\)
  • \(\frac{\mathrm{Bin}(n,p) - np}{\sqrt{np(1-p)}} \approx \mathcal N(0,1)\) when \(n \to \infty\)
  • \(n\) should be \(\gg \frac{1}{p}\)

Good Approx for (\(n=100\), \(p=0.2\))

Bad Approx for (\(n=100\), \(p=0.01\))

Proportion Test

  • We observe \(X \sim Bin(n_1, p_1)\) and \(Y \sim Bin(n_2, p_2)\).
  • \(n_1\), \(n_2\) are known but \(p_1\), \(p_2\) are unknown in \((0,1)\)
  • \(H_0\): \(p_1 = p_2\) or \(H_1\): \(p_1 \neq p_2\)

Test Statistic

\[ \psi(X,Y) = \frac{\hat p_1 - \hat p_2}{\sqrt{\hat p ( 1-\hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \; .\]

  • \(\hat p_1 = X/n_1\), \(\hat p_2 = Y/n_2\)
  • \(\hat p = \frac{X+Y}{n_1+n_2}\) [Wooclap]
  • If \(np_1, np_2 \gg 1\): \(\psi(X) \sim \mathcal N(0,1)\)
  • We reject if \(|\psi(X,Y)| \geq t_{1-\alpha/2}\) (gaussian quantile)

Example (reference)

  • Question: “should we raise taxes on cigarettes to pay for a healthcare reform ?”
  • \(p_1\), \(p_2\): proportion of non-smokers or smokers willing to raise taxes
  • \(H_0\): \(p_1=p_2\) or \(H_1\): \(p_1 > p_2\)
Non-Smokers Smokers Total
YES 351 41 392
NO 254 195 449
Total 605 154 800
  • \(\hat p_1 = \overline X= \approx 0.58\), \(\hat p_2=\overline Y= \approx 0.21\).
  • \(\psi(X,Y)= \frac{\hat p_1 - \hat p_2}{\sqrt{\hat p ( 1-\hat p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \approx 8.99\)
  • \(\mathbb P(\psi(X,Y) > 8.99)\) = 1-cdf(Normal(0,1), 8.99)