TD1: Introduction to statistical models

Exercise 1

We aim to determine whether ENSAI students have any preference for cats or dogs. We assume that, a priori, they have no preference on average. We ask \(n\) students what their preferences are, and we let \(X\) be the number of “cat” answers.

Define \(H_0\) and \(H_1\). Is it a one-sided (unilatéral) or two-sided (bilatéral) test ?
We observe \(10\) students and \(X=8\) “cat” answers. Compute the pvalue in this specific case. Interprete the result.
Write the expression of the pvalue in terms of \(n\) and \(X\) and \(F\), the cdf of \(\mathrm{Bin}(n,0.5)\).
Write a line of code to compute the pvalue in Julia, Python or R.
What is the pvalue if \(H_1\) is instead
1. “Students prefer cats” or
2. “Students prefer dogs”

Exercise 2

Let \((X_1, X_2, \ldots, X_n)\) be iid random variables that follow distribution \(\mathcal E(\lambda)\). We want to test: \[ H_0: \lambda = \frac{1}{2} \quad \text{vs.} \quad H_1: \lambda = 1. \]

Show that if \(X \sim E(\lambda)\) and \(Y \sim \Gamma(k, \lambda)\) are independent, with \(k \in \mathbb{N}^*\), then \(X + Y \sim \Gamma(k+1, \lambda)\). We recall that the density of \(\Gamma(\lambda, k)\) is given by \(p(x) = \frac{\lambda^k x^{k-1}e^{-\lambda x}}{(k-1)!}\)
Deduce that \(S_n=\sum_{i=1}^n X_i\) follows a Gamma distribution \(\Gamma(n, \lambda)\).
For a sample of size \(n = 10\), what is the rejection region of \(S_n\) for the simple likelihood ratio test with \(0.05\) significance level?
We admit that a Gamma distribution \(\Gamma(n, \frac{1}{2})\) is a chi-squared distribution with \(2n\) degrees of freedom, \(\chi^2(2n)\). Bonus: Show this fact for \(n=1\), using a polar change of variable.
The empirical mean is \(\bar{x}_{10} = 2.5\). What can we conclude?
Recall what a cdf is, and read the p-value on the cdf of the \(\chi^2(20)\) distribution
Compare the p-value if we use a Gaussian approximation of \(\sum X_i\) with the TCL. We recall that \(\mathbb V(X_1) = \frac{1}{\lambda^2}\).

Exercise 3

Let \(X_1, X_2, \ldots, X_n\) be random variables drawn from a normal distribution \(N(\theta, 1)\). To test \(H_0: \theta = 5\) against \(H_1: \theta > 5\), we propose the following test:

\[ T = \mathbf 1\{\bar{x} > 5 + u \}, \]

where \(\bar{x}\) is the empirical mean and \(u\) is to be fixed.

1. Derive the function \(g:~t \to \mathbb P(Z \geq t) - e^{-t^2/2}\), where \(Z \sim \mathcal N(0,1)\)
2. Deduce that \(\mathbb P(Z \geq t) \leq e^{-t^2/2}\) for all \(t \geq 0\).
Deduce a value of \(u\) such that the type I error of this test is smaller than a given \(\alpha\). Rewrite the test \(T\) in function of \(\alpha\).
Fix \(\alpha = 1/e\) (and \(u= \sqrt{2/n}\)). Compute the power function.

Exercise 4

Let the family of Pareto distributions with known parameter \(a\) and unknown parameter \(\theta\):

\[ f(x) = \begin{cases} \frac{\theta}{a} \left( \frac{a}{x} \right)^{\theta+1}, & \text{if } x \geq a, \\ 0, & \text{if } x < a. \end{cases} \]

Compute the mean and variance of \(X\), if \(X\) follows a Pareto distribution of parameter \(a\) and \(\theta\).
Rewrite the density in the form \(f(x) = a(x)b(\theta)e^{c(\theta)d(x)}\) and identify \(a\),\(b\),\(c\) and \(d\).
Deduce the general form of the uniformly most powerful test \(UMP_\alpha\) for \(H_0: \theta \geq \theta_0\) vs. \(H_1: \theta < \theta_0\).
For \(a = 1\), construct the test for the null hypothesis: the mean of the distribution is smaller than or equal to 2.
What is the density of \(d(X_1)\) ? \(d\) is defined in Q.2
Write a line of code in Julia, Python or R to compute the rejection region at level \(\alpha=0.05\).