Correlation, Homogeneity and Dependency

Correlation Test

We observe iid paired data \((X_1, Y_1), \dots, (X_n,Y_n)\) of unknown mean \(\mu_X, \mu_Y\) and cov matrix \(\Sigma\).
Cov matrix: \(\Sigma = \left(\begin{matrix} \sigma_X^2 & \mathrm{Cov(X,Y)} \\ \mathrm{Cov(X,Y)} & \sigma_Y^2 \\ \end{matrix}\right)\)
\(H_0: \mathrm{Cov}(X,Y)=0\) or \(H_1: \mathrm{Cov}(X,Y)\neq 0\)
\(\mathrm{Cov(X,Y)} = \mathbb E[(X- \mathbb E[X])(Y- \mathbb E[Y])]\)
\(\sigma_X^2 = \mathrm{Cov(X,X)}\)
\(\sigma_Y^2 = \mathrm{Cov(Y,Y)}\)
\(\mathrm{Cor(X,Y)} = \frac{\mathrm{Cov(X,Y)}}{\sigma_X \sigma_Y}\)

Pearson’s Correlation Test

\(r = \frac{\sum_{i=1}^n (X_i - \overline X)(Y_i - \overline Y)}{\sqrt{\sum_{i=1}^n (X_i - \overline X)^2\sum_{i=1}^n (Y_i - \overline Y)^2}}\)
\(\psi(X,Y) = \frac{r}{\sqrt{1-r^2}}\sqrt{n-2}\)
Under \(H_0\), \(\psi(X,Y) \approx \mathcal T(n-2)\)

Correlation Test

Monte Carlo Simulation with \(n=4\):

ANOVA Test

\(d\) independent groups (bags), each containing \(N_1, \dots, N_d\) individuals. \(N_{\mathrm{tot}} = \sum_{k=1}^d N_k\)
Group \(k\): \((X_{1,k}, \dots, X_{N_k,k}) \in \mathbb R^{N_k}\) are iid \(\mathcal N(\mu_k, \sigma)\). Ex: \(X_{ik} =\) sallary of \(i\) living in region \(k\)
\(H_0: \mu_1 = \dots = \mu_d\) vs \(H_1: \mu_l \neq \mu_k\) for some \((l,k)\) (unknown \(\sigma\))

ANOVA

Empirical mean of group \(k\): \(\overline X_k = \tfrac{1}{N_k}\sum_{i=1}^{N_k} X_{ik}\)
Sum of Squares in group \(k\):
\(SS_k = \sum_{i=1}^{N_k} (X_{ik} - \overline X_k)^2 \sim \sigma^2\chi^2(N_k-1)\) (under \(H_0\))
Var of group \(k\): \(V_k = \tfrac{1}{N_k}SS_k\)
Total empirical mean: \(\overline X = \tfrac{1}{N_{\mathrm{tot}}}\sum_{k=1}^d\sum_{i=1}^{N_k} X_{ik}\)
Sum of Squares btween Groups: \(SSB = \sum_{k=1}^{d} N_k(\overline X_k - \overline X)^2\sim\sigma^2\chi^2(d-1)\) (under \(H_0\))

ANOVA Test Statistic

\(\psi(X) = \frac{\tfrac{1}{N_{\mathrm{tot}} - d}\sum_{k=1}^d SS_k}{\tfrac{1}{d-1}SSB}\)
\(\psi(X) \sim \mathcal F(d-1, N_{\mathrm{tot}}-d)\) (Fisher under \(H_0\))
right-tailed test: \(p_{value} = 1-\mathrm{cdf}(\mathcal F(d-1, N_{\mathrm{tot}}-d)), \psi(x_{obs})\).

Interpretation of variances in ANOVA

\[ \left.\begin{array}{cl} \overline X_k &= \frac{1}{N_k} \sum_{i=1}^{N_k} X_{ik}\\ \overline{X} &= \frac{1}{N_{\mathrm{tot}}} \sum_{k=1}^d N_k\overline X_{k},&\end{array}\right. \left.\begin{array}{cl} V_k &= \frac{1}{N_k}\sum_{i=1}^{N_k} (X_{ik} - \overline X_k)^2 \\ V_W &= \frac{1}{N_{\mathrm{tot}}} \sum_{k=1}^d N_kV_k\\ V_B &= \frac{1}{N_{\mathrm{tot}}}\sum_{i=1}^{N_k} N_k(\overline X_k - \overline X)^2\\ V_{T} &= \frac{1}{N_{\mathrm{tot}}}\sum_{k=1}^d\sum_{i=1}^{N_k} (X_{ik} - \overline X)^2 \end{array} \right. \]

\(V_k\): Empirical variance of group \(k\)
\(V_W\): Average empirical variance within groups (unexplained variance)
\(V_B\): Empirical variance between groups (explained variance)
\(V_T\): Total variance of the sample

\[\psi(X) = \frac{\tfrac{1}{d-1}SSB}{\tfrac{1}{N_{\mathrm{tot}} - d}\sum_{k=1}^d SS_k}=\frac{\tfrac{1}{d-1}V_B}{\tfrac{1}{N_{tot}-d}V_W} = \frac{N_{tot}-d}{d-1} \frac{V_B}{V_W} \sim \mathcal F(d-1, N_{tot}-d) \; .\]

\(\chi^2\) Homogeneity and Independence Tests

\(\chi^2\) Homogeneity Test

\(d\) different bags (or groups), each containing balls of \(m\) potential colors.
If \(d = 3\) and \(m=2\), we observe the following \(2\times 3\) matrix of counts:

	bag 1	bag 2	bag 3	Total
color 1	\(X_{11}\)	\(X_{12}\)	\(X_{13}\)	\(R_1\)
color 2	\(X_{21}\)	\(X_{22}\)	\(X_{23}\)	\(R_2\)
Total	\(N_1\)	\(N_2\)	\(N_3\)	\(N\)

Bag \(j\): \((X_{1j}, X_{2j}) \sim \mathrm{Mult}(N_j, (p_{1j}, p_{2j}))\)
The parameters \(p_{ij}\) are unknown
\(H_0\): \(p_{i1} = p_{i2}=p_{i3}\) for all color \(i\) (bags are homogeneous)
\(H_1\): bags are heterogeneous
\(\sum_{i=1}^m\sum_{j=1}^d \frac{(X_{ij}- N_jp_{ij})^2}{N_jp_{ij}}\) not a test statistic

Chi-Squared Homogeneity Test Statistic

\(\hat p_{i} = \tfrac{1}{N}\sum_{j=1}^{d}X_{ij} = \frac{R_i}{N}\)
\(\psi(X) = \sum_{i=1}^m\sum_{j=1}^d \frac{(X_{ij}- N_j\hat p_{i})^2}{N_j\hat p_{i}}\)
Approximation: \(\psi(X) \sim \chi^2((m-1)(d-1))\)

Example: Soft drink preferences

Split population into \(3\) categories: Young Adults (18-30), Middle-Aged Adults (31-50), and Seniors (51 and above).
\(H_0\): The groups are homogeneous in terms of soft drink preferences

Age Group	Young Adults	Middle-Aged	Seniors	Total
Coke	60	40	30	130
Pepsi	50	55	25	130
Sprite	30	45	55	130
Total	140	140	110	390

\(N_1 \hat p_1 = 140*\frac{130}{390} \approx 46.7\)

\[ \begin{aligned} \psi(X) &= \frac{(60-46.7)^2}{46.7}&+ \frac{(40-46.7)^2}{46.7}&+\frac{(30-36.7)^2}{36.7} \\ &+\frac{(50-46.7)^2}{46.7}&+ \frac{(55-46.7)^2}{46.7}&+\frac{(25-36.7)^2}{36.7}\\ &+\frac{(30-46.7)^2}{46.7}&+ \frac{(45-46.7)^2}{46.7}&+\frac{(55-36.7)^2}{36.7}\\ &\approx 26.57 \end{aligned} \]

1-cdf(Chisq(4), 26.57) # 2.4e-5, reject H_0

\(\chi^2\) Independence Test

Observations: Paired Categorical Variables \((X_1, Y_1), \dots, (X_n, Y_n)\)
e.g. \(X_i \in \{\mathrm{male}, \mathrm{female}\}\), \(Y_i \in \{\mathrm{coffee}, \mathrm{tea}\}\)
Idea: regroup the data into bags, e.g. \(\{\mathrm{male}, \mathrm{female}\}\)
Build the contingency table
Perform a chi-square homogeneity test

Example of contingency table:

Gender	Male	Female	Total
Coffee	30	20	50
Tea	28	22	50
Total	58	42	100

Expected counts:

Gender	Male	Female	Total
Coffee	29	21	50
Tea	29	21	50
Total	58	42	100

Degree of freedom: \((2-1)(2-1) = 1\)

Wilcoxon’s Signed Rank

Symetric Random Variable

A median of \(X\) is a \(0.5\)-quantile of its distribution
If \(X\) has density \(p\), the median \(m\) is such that \[ \int_{-\infty}^m p(x)dx = \int_{m}^{+\infty} p(x)dx = 0.5 \; . \]
A symmetric random variable \(X\) is such that the distribution of \(X\) is the same as the distribution of \(-X\).
In particular, its median is \(0\).
If \(X\) is a symmetric random variable, then it has the same distribution as \(\varepsilon |X|\) where \(\varepsilon\) is independent of \(X\) and uniformly distributed in \(\{-1,1\}\)

Symetrization

If \(X\) and \(Y\) are two independent variables with same density \(p\), then \(X-Y\) is symetric.
Indeed: \[ \begin{aligned} \mathbb P(X - Y \leq t) &=\int_{-\infty}^t \mathbb P(X \leq t+y)p(y)dy \\ &=\int_{-\infty}^t \mathbb P(Y \leq t+x)p(x)dy \\ &= \mathbb P(Y-X \leq t) \; . \end{aligned} \]
\(X\) indep of \(Y\) \(\implies\) \(X-Y\) symmetric \(\implies\) \(\mathrm{median}(X-Y) = 0\)

Dependency Problem for Paired Data

We observe iid pairs of real numbers \((X_1, Y_1), \dots, (X_n, Y_n)\). The density of each pair \((X_i, Y_i)\) is unknown \(p_{XY}(x,y)\).
The marginal distribution of \(X_i\) and \(Y_i\) are, respectively, \[p_X(x) = \int_{y \in \mathbb R} p_{XY}(x,y)~~~~ \text{ and }~~~~ p_{Y}(y) = \int_{x \in \mathbb R} p_{XY}(x,y) \; .\]
\(H_0:\) The median of \((X_i - Y_i)\) is \(0\) for all \(i\)
\(H_1:\) The median of \((X_i - Y_i)\) is not \(0\) for some \(i\)

Generality of \(H_0\)

If \(X_i-Y_i\) is symmetric \(i\), then we are under \(H_0\).
If \(X_i\) is independent of \(Y_i\) for all \(i\), then we are under \(H_0\).

Warning

The pairs are assume to be independent, but within each pair, \(Y_i\) can depend on \(X_i\) (that is, we don’t necessarily have \(p(x,y) =p_X(x)p_Y(y)\)).

Wilcoxon’s Signed Rank Test

\(D_i = X_i - Y_i\)
We define the sign of pair \(i\) as the sign of \(D_i=X_i - Y_i\). It is in \(\{-1, 1\}\).
We define the rank of pair \(i\) as the permutation \(R_i\) that satisfies \(|D_{R_i}| = |D_{(i)}|\), where \[ |D_{(1)}| \leq \dots \leq |D_{(n)}| \]

Properties on the Signed Ranks

Under \(H_0\),

Signs of the \((D_i)\)’s are independent and uniformly distributed in \(\{-1, 1\}\).
In particular, the number of signs equal to \(+1\) follows a Binomial distribution \(\mathcal B(n,0.5)\).
The ranks \(R_i\) of the \((|D_i|)\)’s does not depend on the unknown density \(p_{X-Y}\). \((R_1, \dots, R_n)\) is a random permutation under \(H_0\).
Hence, any deterministic function of the ranks and of the differences is a pivotal test statistic: it does not depend on the distribution of the data under \(H_0\).

Wilcoxon’s Signed Rank Test

Properties on the Signed Ranks

Wilcoxon’s test statistic: \(W_- = \sum_{i=1}^n R_i \mathbf 1\{D_i < 0\}\)
Sometimes, also \(W_+ = \sum_{i=1}^n R_i \mathbf 1\{D_i > 0\}\) or \(\min(W_-, W_+)\).
Gaussian approximation: \(W_- \asymp n(n+1)/4 + \sqrt{n(n+1)(2n+1)/24} \mathcal N(0,1)\)
if \(H_1\): \(\mathrm{median}(D_i) > 0\): left-one sided test on \(W_-\).

This approximation fits well the exact distribution. Monte-Carlo simulation:

To generate a \(W_-\) under \(H_0\) in Julia:

k = rand(Binomial(n, 0.5))
w = sum(randperm(n)[1:k])

Effect of Drug on Blood Pressure

\(H_0\): the drug has no effect. \(H_1\): it lowers the blood pressure

Patient	\(X_i\) (Before)	\(Y_i\) (After)	\(D_i = X_i-Y_i\)	\(R_i\)
1	150	140	10	6 (+)
2	135	130	5	5 (+)
3	160	162	-2	2 (-)
4	145	146	-1	1 (-)
5	154	150	4	4 (+)
6	171	160	11	7 (+)
7	141	138	3	3 (+)

\(W_- = 1+2 = 3\).
From a simulation, we approx \(\mathbb P(W_-=i)\), for \(i \in \{0, 1, 2, 3, 4, 5,6\}\) under \(H_0\) by

[0.00784066, 0.00781442, 0.00781534, 0.01563892, 0.01562184, 0.02343478]

From a simulation, \(p_{value}=\mathbb P(W_- \leq 3) \approx 0.039 < 0.05\)