TD4: Homogeneity/Dependency
Exercise 1
A sociologist wants to investigate whether the choice of transportation mode (Car, Bicycle, or Public Transit) varies among residents of three different cities: City A, City B, and City C.
The sociologist conducted a survey, and the responses are summarized in the contingency table below:
Transportation Mode | City A | City B | City C | Total |
---|---|---|---|---|
Car | 120 | 150 | 100 | 370 |
Bicycle | 80 | 60 | 90 | 230 |
Public Transit | 100 | 90 | 110 | 300 |
Total | 300 | 300 | 300 | 900 |
- Formulate the Hypothesis Testing Problem corresponding to the initial objective of the sociologist. Introduce the notation
- Answer to the initial question using a chi-squared test at a \(0.05\) significance level
Exercise 2
The statistician of an insurance company is tasked with studying the impact of an advertising campaign conducted in 7 regions where the company operates. To do this, he has extracted from the database the number of new clients acquired by a certain number of agents in each region.
Region | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
Number of agents | 9 | 7 | 7 | 6 | 7 | 6 | 6 |
Average number of new clients | 26.88 | 22.34 | 19.54 | 18.95 | 27.17 | 25.87 | 25.72 |
Variance of new clients | 13.54 | 12.59 | 12.87 | 13.42 | 13.17 | 12.56 | 12.64 |
The statistician decides to perform an analysis of variance to test whether the regional factor influences the number of new clients. Let \(X_{ik}\) denote the number of new clients of agent \(i\) in region \(k\), \(N_k\) the number of agents in region \(k\), \(d = 7\) the number of regions and \(N_{\mathrm{tot}} = 48\) the total number of agents. Assume that the random variables \(X_{ik}\) are normal with mean \(\mu_k\) and variance \(\sigma^2\). Define:
\[ \left.\begin{array}{cl} \overline X_k &= \frac{1}{N_k} \sum_{i=1}^{N_k} X_{ik}\\ \overline{X} &= \frac{1}{N_{\mathrm{tot}}} \sum_{k=1}^d N_k\overline X_{k},&\end{array}\right. \left.\begin{array}{cl} V_k &= \frac{1}{N_k}\sum_{i=1}^{N_k} (X_{ik} - \overline X_k)^2 \\ V_W &= \frac{1}{N_{\mathrm{tot}}} \sum_{k=1}^d N_kV_k\\ V_B &= \frac{1}{N_{\mathrm{tot}}}\sum_{i=1}^{N_k} N_k(\overline X_k - \overline X)^2\\ V_{T} &= \frac{1}{N_{\mathrm{tot}}}\sum_{k=1}^d\sum_{i=1}^{N_k} (X_{ik} - \overline X)^2 \end{array} \right. \]
- Formulate the hypothesis testing problem to test whether the number of new clients is homogeneous accross the regions.
- What do \(\overline X_k\), \(\overline X\), \(V_k\), \(V_W\), \(V_B\), \(V_T\) represent?
- Prove the analysis of variance formula: \[V_T = V_W + V_B \; .\] substract and add \(\overline X_k\) in the definition of \(V_T\)
- Compute \(\overline X\), \(V_W\), \(V_B\), and \(V_T\).
- Write the definition of the ANOVA test statistic in terms of \(V_W\) and \(V_B\).
- Did the advertising campaign have the same impact in all regions?
Exercise 3
Some data are collected from 7 students and we want to analyze the correlation between the number of hours students spend studying before an exam and their test scores.
Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
Study Hours | 2.5 | 3.0 | 1.5 | 4.0 | 3.5 | 5.0 | 3.0 |
Test Score | 56 | 64 | 45 | 72 | 68 | 80 | 59 |
- Formulate the hypothesis testing problem for a linear correlation test
- Perform the linear correlation test at level \(0.05\)
Exercise 4
Below are stress scores for \(10\) patients before and after a sport session:
Participant | Stress Score (Before) | Stress Score (After) | Difference | Rank/Sign |
---|---|---|---|---|
1 | 40 | 32 | ||
2 | 38 | 35 | ||
3 | 45 | 40 | ||
4 | 50 | 42.5 | ||
5 | 44 | 41.5 | ||
6 | 48 | 48 | ||
7 | 39 | 30 | ||
8 | 42 | 38 | ||
9 | 47 | 46 | ||
10 | 46.5 | 40 |
We want to test if sport has an effect on the stress of the patients
- Formulate the hypothesis testing problem
- Complete the above table
- Perform a Wilcoxon signed rank test
Exercise 5
Let \(X=(X_1, \dots, X_N)\) be a Gaussian vectors \(\mathcal N(0, I_N)\) in \(\mathbb R^N\) (i.e. \(X_i\) are iid \(\mathcal N(0,1)\)).
- What is the distribution of \(QX\), if \(Q\) is an orthogonal matrix ? (\(QQ^T = I_n\))
- What is the distribution of \(\|PX\|^2\) if \(P\) is an orthogonal projector ?
Use the rank of \(P\) defined as \(rk(P) = dim(Im(P))\)
Definition of orthogonal projector: (\(P^2=P\) and \(P = P^T\)) - Show that if \(P\) is an orthogonal projector, then \(PX\) is independent of \((I-P)X\).
Use the fact that two centered gaussian vectors \(X\),\(Y\) are independent iif \(\mathbb E[X_iY_j] = 0\) for all \(i,j\). Translate this fact in a matrix form. - What is the distribution of \(\frac{n-rk(P)}{rk(P)}\frac{\|PX\|^2}{\|(I-P)X\|^2}\) ?
- Show that if \(P\), \(P_0\) are two orthogonal projectors such that \(Im(P_0) \subset Im(P)\), then \(P(I-P_0)X\) is independent of \((I-P)(I-P_0)X\). What is the distribution of \(\|P(I-P_0)X\|^2\) ?
Show first that \(PP_0=P_0P= P_0\), and that \(P-P_0\) is an orthogonal projector - What is the orthogonal projector \(P_0\) on \(\mathrm{Span}(1, \dots, 1)\) ? Deduce that \((X_i - \overline X)\) is independent of \(\overline X\) for all \(i\).
- We divide \(N\) into \(d\) blocks: \(N = N_1 + \dots + N_d\). We write \((X_1, \dots, X_n) = ((X_{11}, \dots, X_{N_11}), (X_{12}, \dots X_{N_2 2}), \dots, (X_{N_d 1}, \dots, X_{N_d d}))\).
- What is the orthogonal projection on \(E_k =((0, \dots, 0), \dots, (0, \dots, 0),(1, \dots, 1), (0,\dots, 0) ,\dots, (0, \dots, 0))\) ? (\(0\) everywhere except on block \(k\) where we put ones everywhere)
- Give the orthogonal projection \(P\) on \(\mathrm{Span}(E_1, \dots, E_d)\). Explicit \((I-P)(I-P_0)X\) and \(P(I-P_0)X\).
- Deduce the distribution of \(\frac{d-1}{n-d}\frac{\|(I-P)(I-P_0)X\|^2}{\|P(I-P_0)X\|^2}\) and of the ANOVA Test Statistic under \(H_0\).