TD4: Homogeneity/Dependency

Exercise 1

A sociologist wants to investigate whether the choice of transportation mode (Car, Bicycle, or Public Transit) varies among residents of three different cities: City A, City B, and City C.

The sociologist conducted a survey, and the responses are summarized in the contingency table below:

Transportation Mode City A City B City C Total
Car 120 150 100 370
Bicycle 80 60 90 230
Public Transit 100 90 110 300
Total 300 300 300 900
  1. Formulate the Hypothesis Testing Problem corresponding to the initial objective of the sociologist. Introduce the notation
  2. Answer to the initial question using a chi-squared test at a \(0.05\) significance level

Exercise 2

The statistician of an insurance company is tasked with studying the impact of an advertising campaign conducted in 7 regions where the company operates. To do this, he has extracted from the database the number of new clients acquired by a certain number of agents in each region.

Region 1 2 3 4 5 6 7
Number of agents 9 7 7 6 7 6 6
Average number of new clients 26.88 22.34 19.54 18.95 27.17 25.87 25.72
Variance of new clients 13.54 12.59 12.87 13.42 13.17 12.56 12.64

The statistician decides to perform an analysis of variance to test whether the regional factor influences the number of new clients. Let \(X_{ik}\) denote the number of new clients of agent \(i\) in region \(k\), \(N_k\) the number of agents in region \(k\), \(d = 7\) the number of regions and \(N_{\mathrm{tot}} = 48\) the total number of agents. Assume that the random variables \(X_{ik}\) are normal with mean \(\mu_k\) and variance \(\sigma^2\). Define:

\[ \left.\begin{array}{cl} \overline X_k &= \frac{1}{N_k} \sum_{i=1}^{N_k} X_{ik}\\ \overline{X} &= \frac{1}{N_{\mathrm{tot}}} \sum_{k=1}^d N_k\overline X_{k},&\end{array}\right. \left.\begin{array}{cl} V_k &= \frac{1}{N_k}\sum_{i=1}^{N_k} (X_{ik} - \overline X_k)^2 \\ V_W &= \frac{1}{N_{\mathrm{tot}}} \sum_{k=1}^d N_kV_k\\ V_B &= \frac{1}{N_{\mathrm{tot}}}\sum_{i=1}^{N_k} N_k(\overline X_k - \overline X)^2\\ V_{T} &= \frac{1}{N_{\mathrm{tot}}}\sum_{k=1}^d\sum_{i=1}^{N_k} (X_{ik} - \overline X)^2 \end{array} \right. \]

  1. Formulate the hypothesis testing problem to test whether the number of new clients is homogeneous accross the regions.
  2. What do \(\overline X_k\), \(\overline X\), \(V_k\), \(V_W\), \(V_B\), \(V_T\) represent?
  3. Prove the analysis of variance formula: \[V_T = V_W + V_B \; .\] substract and add \(\overline X_k\) in the definition of \(V_T\)
  4. Compute \(\overline X\), \(V_W\), \(V_B\), and \(V_T\).
  5. Write the definition of the ANOVA test statistic in terms of \(V_W\) and \(V_B\).
  6. Did the advertising campaign have the same impact in all regions?

Exercise 3

Some data are collected from 7 students and we want to analyze the correlation between the number of hours students spend studying before an exam and their test scores.

Student 1 2 3 4 5 6 7
Study Hours 2.5 3.0 1.5 4.0 3.5 5.0 3.0
Test Score 56 64 45 72 68 80 59
  1. Formulate the hypothesis testing problem for a linear correlation test
  2. Perform the linear correlation test at level \(0.05\)

Exercise 4

Below are stress scores for \(10\) patients before and after a sport session:

Participant Stress Score (Before) Stress Score (After) Difference Rank/Sign
1 40 32
2 38 35
3 45 40
4 50 42.5
5 44 41.5
6 48 48
7 39 30
8 42 38
9 47 46
10 46.5 40

We want to test if sport has an effect on the stress of the patients

  1. Formulate the hypothesis testing problem
  2. Complete the above table
  3. Perform a Wilcoxon signed rank test

Exercise 5

Let \(X=(X_1, \dots, X_N)\) be a Gaussian vectors \(\mathcal N(0, I_N)\) in \(\mathbb R^N\) (i.e. \(X_i\) are iid \(\mathcal N(0,1)\)).

    1. What is the distribution of \(QX\), if \(Q\) is an orthogonal matrix ? (\(QQ^T = I_n\))
    2. What is the distribution of \(\|PX\|^2\) if \(P\) is an orthogonal projector ?
      Use the rank of \(P\) defined as \(rk(P) = dim(Im(P))\)
      Definition of orthogonal projector: (\(P^2=P\) and \(P = P^T\))
    3. Show that if \(P\) is an orthogonal projector, then \(PX\) is independent of \((I-P)X\).
      Use the fact that two centered gaussian vectors \(X\),\(Y\) are independent iif \(\mathbb E[X_iY_j] = 0\) for all \(i,j\). Translate this fact in a matrix form.
    4. What is the distribution of \(\frac{n-rk(P)}{rk(P)}\frac{\|PX\|^2}{\|(I-P)X\|^2}\) ?
    5. Show that if \(P\), \(P_0\) are two orthogonal projectors such that \(Im(P_0) \subset Im(P)\), then \(P(I-P_0)X\) is independent of \((I-P)(I-P_0)X\). What is the distribution of \(\|P(I-P_0)X\|^2\) ?
      Show first that \(PP_0=P_0P= P_0\), and that \(P-P_0\) is an orthogonal projector
    6. What is the orthogonal projector \(P_0\) on \(\mathrm{Span}(1, \dots, 1)\) ? Deduce that \((X_i - \overline X)\) is independent of \(\overline X\) for all \(i\).
  1. We divide \(N\) into \(d\) blocks: \(N = N_1 + \dots + N_d\). We write \((X_1, \dots, X_n) = ((X_{11}, \dots, X_{N_11}), (X_{12}, \dots X_{N_2 2}), \dots, (X_{N_d 1}, \dots, X_{N_d d}))\).
    1. What is the orthogonal projection on \(E_k =((0, \dots, 0), \dots, (0, \dots, 0),(1, \dots, 1), (0,\dots, 0) ,\dots, (0, \dots, 0))\) ? (\(0\) everywhere except on block \(k\) where we put ones everywhere)
    2. Give the orthogonal projection \(P\) on \(\mathrm{Span}(E_1, \dots, E_d)\). Explicit \((I-P)(I-P_0)X\) and \(P(I-P_0)X\).
    3. Deduce the distribution of \(\frac{d-1}{n-d}\frac{\|(I-P)(I-P_0)X\|^2}{\|P(I-P_0)X\|^2}\) and of the ANOVA Test Statistic under \(H_0\).