Hypothesis Testing

Introduction

Tests provide the theoretical basis for decision-making based on data. For example, doctors diagnose diseases using specific biological markers, industrial quality engineers evaluate the quality of a production batch, and climate scientists determine whether there are significant changes in measurements compared to the pre-industrial era.

General Principles

Hypothesis testing is a key statistical method that enables professionals to make informed decisions. The process involves setting up two competing hypotheses:

the null hypothesis $H_0$, which assumes no effect or no difference. This is usually an a priori on the data.
the alternative hypothesis $H_1$, which suggests that there is a significant effect or difference.

Let’s illustrate this with the example of a doctor diagnosing a disease, such as diabetes, using blood glucose levels:

The doctor begins by assuming the null hypothesis $H_0$: the patient’s blood glucose levels are normal, indicating no diabetes.
The alternative hypothesis ($H_1$) would suggest that the patient’s glucose levels are abnormally high, indicating diabetes.

After collecting the patient’s blood glucose data, the doctor compares it to a standard threshold for diagnosing diabetes. A hypothesis test is then performed to evaluate whether the observed glucose level is significantly higher than the threshold, or whether the difference could simply be due to random variation.

The testing process can be summarized as follows:

Fix an objective: test whether Bob has diabetes
Design an experiment: measure of glucose level
Define hypotheses
- Null hypothesis: $H_0$: a priori, Bob has no diabetes
- Alternative hypothesis: $H_1$: Bob has diabete
Define a decision dule: function of the glucose level
Collect data: do the measure of glucose level
Apply the decision rule: reject $H_0$ or not
Draw a conclusion: should Bob follow a treatment or make other tests?

Good and Bad Decisions

Decision	$H_0$ True	$H_1$ True
$T=0$	True Negative (TN)	False Negative (FN) Second Type Error Missed Detection
$T=1$	False Positive (FP) First Type Error False Alarm	True Positive (TP)

In general, observing $T=1$ is just one piece of evidence supporting $H_1$. It’s possible that we were simply unlucky and the data led the test to reject $H_0$. Additionally, the sample used might not be representative, or the elevated glucose level could be caused by another condition, not diabetes. Therefore, we cannot blindly accept $H_1$ based on this result alone. Further investigation is essential when $T=1$ to minimize the risk of making a Type I error.

The same applies when $T=0$. A result of $T=0$ simply indicates that the test provides no evidence in favor of $H_1$, but it doesn’t mean that $H_1$ is false. It is possible that the patient exhibits other symptoms that could still suggest a diagnosis of diabetes, despite the lack of evidence from the test.

Examples

Dice biased toward $6$

Objective: test if Bob is cheating with a dice.
Experiment: Bob rolls the dice $10$ times.
Hypotheses:
- $H_0$: the probability of getting $6$ is $1/6$
- $H_1$: the probability of getting $6$ is larger than $1/6$
Decision rule: the probability of getting a number of $6$ at least equal to the number of observed $6$ is $< 0.05$
Data: the dice falls $10$ times on $6$
Decision: the probability is $1/6^{10} < 0.05$
Conclusion?

In the above example, we formally observe $(X_1, \dots, X_n)$ iid where $X_i \in \{1, \dots, 6\}$ where each $\mathbb P(X_i = k) = p_k$ Imagine that instead of ten $6$, we observed $600$ tosses leading to the following counts:

The number of $6$ is really close to $100 = 600/6$, which implies that we would not reject $H_0$ in the previous example. We would reject however if we wanted to test whether the dice is fair, since it is highly biased toward $2$.

Same data, two conclusions

$H_0: p_6 = 1/6$ VS $H_1: p_6 > 1/6$
Do not reject $H_1$ ($H_0$ is “likely”)

$H_0: (p_1=\dots=p_6 = 1/6)$ (dice is fair) VS $H_1: \exists k: p_k > 1/6$
Reject $H_1$ ($H_0$ is “unlikely”)

Medical test

Objective: test if a fetus has Down syndrome
Experiment: measure the Nucal translucency
Hypotheses:
- $H_0$: nucal translucency is normal $\sim \mathcal N(1.5,0.8)$
- $H_1$: nucal translucency is large
Decision rule: reject if $P_0(X \geq x_{obs}) \leq 0.05$ (“$95^{th}$ percentile”)
Collect data: $x_{obs}=3.02$
Make a decision: Calculate the value of $P_0(X \geq 3.02)=0.029$
Conclusion?

Basics of Hypothesis Testing

Estimation VS Test

We observe some data $X$ in a measurable space $(\mathcal X, \mathcal A)$.
Example $\mathcal X = \mathbb R^n$: $X= (X_1, \dots, X_n)$.

Estimation

One set of distributions $\mathcal P$
Parameterized by $\Theta$ \[ \mathcal P = \{P_{\theta},~ \theta \in \Theta\} \; . \]
$\exists \theta \in \Theta$ such that $X \sim P_{\theta}$

Goal: estimate a given function of $P_{\theta}$, e.g.:

$F(P_{\theta}) = \int x dP_{\theta}$
$F(P_{\theta}) = \int x^2 dP_{\theta}$

Test

Two sets of distributions $\mathcal P_0$, $\mathcal P_1$
Parameterized by disjoints $\Theta_0$, $\Theta_1$ \[ \begin{aligned} \mathcal P_0 = \{P_{\theta} : \theta \in \Theta_1\}, ~~~~ \mathcal P_1 = \{P_{\theta} : \theta \in \Theta_1\}\; . \end{aligned} \]
$\exists \theta \in \Theta_0 \cup \Theta_1$ such that $X \sim P_{\theta}$

Goal: decide between \[H_0: \theta \in \Theta_0 \text{ or } H_1: \theta \in \Theta_1\]

Estimation

Test

Type of Problems

Simple VS Simple:\[\Theta_0 = \{\theta_0\} \text{ and } \Theta_1 = \{\theta_1\}\]
Simple VS Multiple:\[\Theta_0 = \{\theta_0\}\]
Else: Multiple VS Multiple

Simple VS Simple

Multiple VS Multiple

Parametric VS Non-Parametric

Parametric: $\Theta_0$ and $\Theta_1$ included in subspaces of finite dimension.
Non-parametric: otherwise

Example of multiple VS multiple parametric problem:

$H_0: X \sim \mathcal N(\theta,\sigma)$, unknown $\theta < 0$ and unknown $\sigma > 0$: $\Theta_0 \subset \mathbb R^2$
$H_1: X \sim \mathcal N(\theta,\sigma)$, unknown $\theta > 0$ and unknown $\sigma > 0$: $\Theta_1 \subset \mathbb R^2$

Decision Rule and Test Statistic

Decision rule

A decision rule or test $T$ is a measurable function from $\mathcal X$ to $\{0,1\}$: \[ T : \mathcal X \to \{0,1\}\; .\]
It can depend on the sets $\mathcal P_0$ and $\mathcal P_1$
but not on any unknown parameter.

Test statistic

a test statistic $\psi$ is a measurable function from $\mathcal X$ to $\mathbb R$: \[ \psi : \mathcal X \to \mathbb R\; .\]
It can depend on the sets $\mathcal P_0$ and $\mathcal P_1$

Rejection Region

For a given test $T$, the rejection region $\mathcal R \subset \mathbb R$ is the set \[\{\psi(x) \in \mathbb R:~ T(x)=1\} \; .\]
Example of $\mathcal R$, for a given threshold $t>0$, \[ \begin{aligned} T(x) &= \mathbf{1}\{\psi(x) > t\}:~~~~~~\mathcal R = (t,+\infty)\\ T(x) &= \mathbf{1}\{\psi(x) < t\}:~~~~~~\mathcal R = (-\infty,t)\\ T(x) &= \mathbf{1}\{|\psi(x)| > t\}:~~~~~~\mathcal R = (-\infty,t)\cup (t, +\infty)\\ T(x) &= \mathbf{1}\{\psi(x) \not \in [t_1, t_2]\}:~~~~~~\mathcal R = (-\infty,t_1)\cup (t_2, +\infty)\; \end{aligned} \]

Recall of Proba

Consider a probability measure $P$ on $\mathbb R$.

CDF (cumulative distribution function): \[x \to P(~(-\infty,x]~) = \mathbb P(X \leq x) ~~~~\text{(if $X \sim P$ under $\mathbb P$)}\]

Continous Measures

density wrp to Lebesgue: $dP(x) = p(x)dx$
PDF (proba density function): $x \to p(x)$
CDF: $x \to \int_{\infty}^x p(x')dx'$
$\alpha$-quantile $q_{\alpha}$: $\int_{\infty}^{q_{\alpha}} p(x)dx = \alpha$
or $\mathbb P(X \leq q_{\alpha} = \alpha)$

Discrete Measures

density wrp to counting measure: $P(X=x) = p(x)$
CDF: $x \to \sum_{x' \leq x} p(x')dx'$
$\alpha$-quantile $q_{\alpha}$: \[\inf_{q \in \mathbb R}\{q:~\sum_{x_i \leq q}p(x_i) > \alpha\}\]

Examples

Gaussian/Bernoulli

Gaussian $\mathcal N(\mu,\sigma)$: \[p(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

Approximation of sum of iid RV (TCL)

Binomial $\mathrm{Bin}(n,q)$: \[p(x)= \binom{n}{x}q^x (1-q)^{n-x}\]

Number of success among $n$ Bernoulli $q$

Exponential/Geometric

Exponential $\mathcal E(\lambda)$: \[p(x) = \lambda e^{-\lambda x}\]

Waiting time for an atomic clock of rate $\lambda$

Geometric $\mathcal{G}(q)$: \[p(x)= q(1-q)^{x-1}\]

Index of first success for iid Bernoulli $q$

Gamma/Poisson

Gamma $\Gamma(k, \lambda)$: \[p(x) = \frac{\lambda^k x^{k-1}e^{-\lambda x}}{(k-1)!}\]

Waiting time for $k$ atomic clocks of rate $\lambda$

Poisson $\mathcal{P}(\lambda)$: \[p(x)=\frac{\lambda^x}{x!}e^{-\lambda}\]

Number of tics before time $1$ of an atomic clock of rate $\lambda$

Simple VS Simple

We observe $X \in \mathcal X=\mathbb R^n$.
$H_0: X \sim P$ or $H_1: X \sim Q$.
We know $P$ and $Q$ but we do not know whether $X \sim P$ or $X \sim Q$

For a given test $T$ we define:

level of $T$ = (type-1 error): $\alpha = P(T(X)=1)$
power of $T$ = 1 - (type-2 error): $\beta = Q(T(X)=1) = 1-Q(T(X)=0)$

Decision	$H_0: X \sim P$	$H_1: X \sim Q$
$T=0$	$1-\alpha$	$1-\beta$
$T=1$	$\alpha$	$\beta$

unbiased: $\beta \geq \alpha$
$\alpha = 0$ for trivial test $T(x)=0$ ! But $\beta =0$ too…

Likelihood Ratio Test

Test $T$ that maximizes $\beta$ at fixed $\alpha$?
Idea: Consider the likelihood ratio test statistic \[\psi(x)=\frac{dQ}{dP}(x) = \frac{q(x)}{p(x)}\]
We consider the likelihood ratio test \[ T^*(x)=\mathbf 1\left\{\frac{q(x)}{p(x)} > t_{\alpha}\right\} \;\]
Equivalently, we can consider the log-likelihood ratio test: \[T^*(x)=\mathbf 1\left\{\log\left(\frac{q(x)}{p(x)}\right) > \log(t_{\alpha})\right\}\]
$t_{\alpha}$ is the $\alpha$-quantile of the distrib $\frac{q(X)}{p(X)}$ if $X\sim P$ \[ \mathbb P_{X \sim P}\left(\frac{q(X)}{p(X)} > t_{\alpha}\right) = \alpha\]

Neyman Pearson

Neyman Pearson’s theorem

The likelihood Ratio Test of level $\alpha$ maximizes the power among all tests of level $\alpha$.

Example: $X \sim \mathcal N(\theta, 1)$.
$H_0: \theta=\theta_0$, $H_1: \theta=\theta_1$.
Check that the log-likelihood ratio is \[ (\theta_1 - \theta_0)x + \frac{\theta_0^2 -\theta_1^2}{2} \; .\]
If $\theta_1 > \theta_0$, an optimal test if of the form \[ T(x) = \mathbf 1\{ x > t \} \; .\]

Proof of Neyman Pearson’s theorem

We prove the theorem in the case where $P$ and $Q$ each have a density $p$ and $q$ on $\mathbb R^n$. For any $t > 0$, define \[I_t(P, Q) = \int_{x \in \mathbb R^n} |q(x) - tp(x)|dx \; .\]

When $t=1$, this quantity is equal to the total-variation distance between $P$ and $Q$. For any event A in $\mathbb R^n$ , it holds that

\[ \begin{split} I_t(P, Q) &= \int_{x \in \mathbb R^n} |q(x) - tp(x)|dx \\ &= \int_{q(x)> tp(x)} q(x) - tp(x)dx + \int_{q(x)< tp(x)} tp(x) - q(x)dx\\ &= 2\int_{x \in \mathbb R^n} \mathbf1_{q(x)> tp(x)}(q(x) - tp(x))dx + t-1 \\ &\geq 2\int_{x \in \mathbb R^n} \mathbf 1_A \mathbf1_{q(x)> tp(x)}(q(x) - tp(x))dx + t-1\\ &\geq t-1 + 2(Q(A) - tP(A)) \; . \end{split} \]

If $A = \{x \in \mathbb R^n : q(x) > tp(x)\}$, then the two last inequalities are equalities. In particular, \[I_t(P, Q) = t-1 + 2\sup_{A \subset \mathbb R^n}(Q(A) - tP(A))) \; .\]

Assume that the type-1 error of $T$ is smaller than $\alpha$: $P(T=1) \leq \alpha$.

The power of $T$, $Q(T = 1)$, is upper-bounded as follows: \[ \begin{split} Q(T=1) &\leq Q(T=1) + t(\alpha - P(T=1))\\ &= \alpha t + Q(T=1) - tP(T=1) \\ &\leq \alpha t + \frac{t-1}{2} + \frac{1}{2}I_t \; . \end{split} \]

There is equality in the second inequality if $T=1$ is the event $\{ \frac{q(X)}{p(X)} > t \}$.
There is equality in the first inequality if $P(T=1) = \alpha$.

Let $t_{\alpha}$ be such that $P(\frac{q(X)}{p(X)}> t_{\alpha}) = \alpha$. The test $T^*(X) = \mathbf1 \{\frac{q(X)}{p(X)}> t_{\alpha}\}$ satisfies the two above points, since its rejection event is exactly $\{T^*(X) = 1\} = \{\frac{q(X)}{p(X)}> t_{\alpha}\}$ and since $P(T^* = 1) = \alpha.$ Hence, for any test $T$ of type 1 error smaller than $\alpha$, it holds that $Q(T = 1) \leq Q(T^* = 1)$.

Examples

Example (Gaussians)

Let $P_{\theta}$ be the distribution $\mathcal N(\theta,1)$.
Observe $n$ iid data $X = (X_1, \dots, X_n)$
$H_0: X \sim P^{\otimes n}_{\theta_0}$ or $H_1: X \sim P^{\otimes n}_{\theta_1}$
Remark: $P^{\otimes n}_{\theta}= \mathcal N((\theta,\dots, \theta), I_n)$
Density of $P^{\otimes n}_{\theta}$:

\[ \begin{aligned} \frac{d P^{\otimes n}_{\theta}}{dx} &= \frac{d P_{\theta}}{dx_1}\dots\frac{d P_{\theta}}{dx_n} \\ &= \frac{1}{\sqrt{2\pi}^n}\exp\left({-\sum_{i=1}^n\frac{(x_i - \theta)^2}{2}}\right) \\ &= \frac{1}{\sqrt{2\pi}^n}\exp\left(-\frac{\|x\|^2}{2} + n\theta \overline x - \frac{\theta^2}{2}\right)\; . \end{aligned} \]

Log-likelihood ratio test:
$T(x) = \mathbf 1\{\overline x > t_{\alpha}\}$ if $\theta_1 > \theta_0$
$T(x) = \mathbf 1\{\overline x < t_{\alpha}\}$ otherwise

Example: Exponential Families

A set of distributions $\{P_{\theta}\}$ is an exponential family if each density $p_{\theta}(x)$ is of the form \[ p_{\theta}(x) = a(\theta)b(x) \exp(c(\theta)d(x)) \; , \]
We observe $X = (X_1, \dots, X_n)$. Consider the following testing problem: \[H_0: X \sim P_{\theta_0}^{\otimes n}~~~~ \text{or}~~~~ H_1:X \sim P_{\theta_1}^{\otimes n} \; .\]
Likelihood ratio: \[ \frac{dP^{\otimes n}_{\theta_1}}{dP^{\otimes n}_{\theta_0}} = \left(\frac{a(\theta_1)}{a(\theta_0)}\right)^n\exp\left((c(\theta_1)-c(\theta_0))\sum_{i=1}^n d(x_i)\right) \; . \]

\[ T(X) = \mathbf 1\left\{\frac{1}{n}\sum_{i=1}^n d(X_i) > t\right\} \;. ~~~~\text{(calibrate $t$)}\]

Example: Radioactive Source

The number of particle emitted in $1$ unit of time is follows distribution $P \sim \mathcal P(\lambda)$.
We observe $20$ time units, that is $N \sim \mathcal P(20\lambda)$.
Type A sources emit an average of $\lambda_0 = 0.6$ particles/time unit
Type B sources emit an average of $\lambda_1 = 0.8$ particles/time unit
$H_0$: $N \sim \mathcal P(20\lambda_0)$ or $H_1$: $N\sim \mathcal P(20\lambda_1)$
Likelihood Ratio Test: \[T(X)=\mathbf 1\left\{\sum_{i=1}^{20}X_i > t_{\alpha}\right\} \; .\]
$t_{0.95}$: quantile(Poisson(20*0.6), 0.95) gives $18$
$\mathbb P(\mathcal P(20*0.6) \leq 17)$: 1-cdf(Poisson(20*0.6), 17) gives $0.063$
$\mathbb P(\mathcal P(20*0.6) \leq 18)$: 1-cdf(Poisson(20*0.6), 18) gives $0.038$: reject if $N \geq 19$

Multiple-Multiple Tests

Generalities

$H_0 = \{ \mathcal P_{\theta}, \theta \in \Theta_0 \}$ is not a singleton
No meaning of $\mathbb P_{H_0}(X \in A)$
level of $T$: \[ \alpha = \sup_{\theta \in \Theta_0}P_{\theta}(T(X)=1) \; . \]
power function $\beta: \Theta_1 \to [0,1]$ \[ \beta(\theta) = P_{\theta}(T(X)=1) \]
$T$ is unbiased if $\beta (\theta) \geq \alpha$ for all $\theta \in \Theta_1$.
If $T_1$, $T_2$ are two tests of level $\alpha_1$, $\alpha_2$
$T_2$ is uniformly more powerfull ($UMP$) than $T_1$ if
1. $\alpha_2 \leq \alpha_1$
2. $\beta_2(\theta) \geq \beta_1(\theta)$ for all $\theta \in \Theta_1$
$T^*$ is $UMP_{\alpha}$ if it is $UMP$ than any other test $T$ of level $\alpha$.

Multiple-Multiple Tests in $\mathbb R$

Assumption: $\Theta_0 \cup \Theta_1 \subset \mathbb R$.

One-tailed tests: \[ \begin{aligned} H_0: \theta \leq \theta_0 ~~~~ &\text{ or } ~~~ H_1: \theta > \theta_0 ~~~ \text{(right-tailed: unilatéral droit)}\\ H_0: \theta \geq \theta_0 ~~~ &\text{ or } ~~~ H_1: \theta < \theta_0 ~~~ \text{(left-tailed: unilatéral gauche)} \end{aligned} \]
Two-tailed tests: \[ \begin{aligned} H_0: \theta = \theta_0 ~~~ &\text{ or } ~~~ H_1: \theta \neq \theta_0 ~~~ \text{(simple/multiple)}\\ H_0: \theta \in [\theta_1, \theta_2] ~~~ &\text{ or } ~~~ H_1: \theta \not \in [\theta_1, \theta_2] ~~~ \text{( multiple/multiple)} \end{aligned} \]

Theorem

Assume that $p_{\theta}(x) = a(\theta)b(x)\exp(c(\theta)d(x))$ and that $c$ is a non-decreasing (croissante) function, and consider a one-tailed test problem.
There exists a UMP$\alpha$ test. It is $\mathbf 1\{\sum d(X_i) > t \}$ if $H_1: \theta > \theta_0$ (right-tailed test).

Pivotal Test Statistic and P-value

Pivotal Test Statistic

Consider $\Theta_0$ not singleton. $\mathbb P_{H_0}(X \in A)$ has no meaning.

Pivotal test statistic

$\psi: \mathcal X \to \mathbb R$ is pivotal if the distribution of $\psi(X)$ under $H_0$ does not depend on $\theta \in \Theta_0$:

for any $\theta, \theta' \in \Theta_0$, and any event $A$, \[ \mathbb P_{\theta}(\psi(X) \in A) = \mathbb P_{\theta'}(\psi(X) \in A) \; .\]

Example: If $X=(X_1, \dots, X_n)$ are iid $\mathcal N(0, \sigma)$, the distrib of \[ \psi(X) = \frac{\sum_{i=1}^n \overline X}{\sqrt{\sum_{i=1}^n X_i^2}}\] does not depend on $\sigma$.

P-value

See the (Pluto notebook: Illustration of pvalue)

P-value: definition

We define $p_{value}(x_{\mathrm{obs}}) =\mathbb P(\psi(X) > x_{\mathrm{obs}})$ for a right-tailed test.

For a two-tailed test, $p_{value}(x_{\mathrm{obs}}) =2\min(\mathbb P(\psi(X) > x_{\mathrm{obs}}),\mathbb P(\psi(X) < x_{\mathrm{obs}}))$

P-value under $H_0$

Under $H_0$, for a pivotal test statistic $\psi$, $p_{value}(X)$ has a uniform distribution $\mathcal U([0,1])$.

Proof.

The p-value is a probability, so it belongs to $[0,1]$. Let $F_{\psi}$ be the cumulative distribution function of a random variable $\psi(X)$ when $X$ follows distribution $P$, that is $G_{\psi}(t) = \mathbb P(\psi(X) > t)$. It holds that \[ \mathbb P(\psi(X') > \psi(X) ~|~ \psi(X)) = F_{\psi}(\psi(X)) \] If $u \in [0,1]$, then \[ \begin{aligned} \mathbb P(\mathrm{pvalue}(X) > u) &= \mathbb P(F_{\psi}(\psi(X)) > u) \\ &= \mathbb P(\psi(X) > F_{\psi}^{-1}(u)) \\ &= 1- F_{\psi}(F_{\psi}^{-1}(u)) = 1-u \end{aligned} \] Hence, $\mathrm{pvalue}(X)$ is uniform when $X$ follows distribution $\mathbb P$. \[\tag*{$\blacksquare$}\]

In practice: reject if $p_{value}(X) < \alpha = 0.05$
$\alpha$ is the level or type-1-error of the test

Illustration if $\psi(X) \sim \mathcal N(0,1)$ under $H_0$:

Decision	\(H_0: X \sim P\)	\(H_1: X \sim Q\)
\(T=0\)	\(1-\alpha\)	\(1-\beta\)
\(T=1\)	\(\alpha\)	\(\beta\)

Hypothesis Testing

Introduction

General Principles

Good and Bad Decisions

Examples

Dice biased toward \(6\)

Medical test

Basics of Hypothesis Testing

Estimation VS Test

Estimation

Test

Estimation

Test

Type of Problems

Simple VS Simple

Multiple VS Multiple

Parametric VS Non-Parametric

Decision Rule and Test Statistic

Rejection Region

Recall of Proba

Continous Measures

Discrete Measures

Examples

Gaussian/Bernoulli

Exponential/Geometric

Gamma/Poisson

Simple VS Simple

Likelihood Ratio Test

Neyman Pearson

Proof of Neyman Pearson’s theorem

Examples

Example (Gaussians)

Example: Exponential Families

Example: Radioactive Source

Multiple-Multiple Tests

Generalities

Multiple-Multiple Tests in \(\mathbb R\)

Pivotal Test Statistic and P-value

Pivotal Test Statistic

P-value