TP: Hypothesis Testing
0. Student Test
We observe \((X_1, \dots, X_{n_1})\) iid \(\mathcal N(\mu_1, \sigma_1)\) and \((Y_1, \dots, Y_{n_2})\) iid \(\mathcal N(\mu_2, \sigma_2)\). We assume that the vectors \(X\) and \(Y\) are independent. We want to test \(H_0\): \(\mu_1 = \mu_2\) VS \(H_1\): \(\mu_1 \neq \mu_2\). We observe the data
=[-0.2657064426519085, -0.27538323622274347, 0.11419811877193782, 0.1158736466676504, 1.7071154417851981, 0.9306910454777643, 0.5834941669559498, -1.536447927372139, -1.4158768806157345, 1.0532694288697444, 1.2955133629200777, -0.4195557179577367]
X=[-0.6452416530469819, 0.3662048411679129, -0.09943069837472361, 0.8738423322164134, 0.7163913715056272, -0.32450102319617485, 0.9159821874321818, -2.3583609849887224] Y
- Compute the student-Welch test statistic \(\tfrac{\overline X - \overline Y}{\sqrt{\hat\sigma_1/n_1 + \hat\sigma_2/n_2}}\)
- Conclude using a Gaussian approximation (use the cdf of a N(0,1))
- Conclude using an
UnequalVarianceTTest
(julia) ortest.welch
(R) orscipy.stats.ttest_ind(a, b, equal_var=False)
(python) - Bonus. Conclude using a better chi-squared approximation. Compare these result to 3.
1. Monte Carlo and Chi-squared Tests
A statistician observes \(X = (X_1, \dots, X_n)\) where the \(X_i's\) are iid of distribution \(P\). If the problem is to test whether \(P\) is Gaussian with known \(\mu\) and \(\sigma\), the problem is:
\[H_0: P=\mathcal N(\mu, \sigma) \quad \text{VS} \quad H_1: P\neq \mathcal N(\mu, \sigma)\]
If \(\mu\) and \(\sigma\) are unknown, the problem is \[H_0: P\in \{\mathcal N(\mu, \sigma), \mu \in \mathbb R, \sigma >0\}\quad \text{VS} \quad H_1: P\not \in \{\mathcal N(\mu, \sigma), \mu \in \mathbb R, \sigma >0\}\]
We first assume that \(\mu\) and \(\sigma\) are known, and that:
= 0
mu = 1
sigma = 100
n = 5 m
This practical exercise aims to empirically demonstrate how a chi-squared test statistic converges to a chi-squared distribution in both known and unknown parameter scenarios. We will:
- Divide the observation space into 5 disjoint intervals
- Count how many observations fall into each interval for randomly generated data
- Calculate the chi-squared test statistic for randomly generated data
- Repeat this process 1,000 times to build an empirical distribution (an histogram)
- The resulting empirical histogram should approach a theoretical chi-squared distribution as both the sample size \(n\) and the number of repetitions \(N\) approach infinity.
Questions
- Generate a vector \(X\) made of \(n\) iid \(\mathcal N(\mu, \sigma)\)
- Compute the vector \(Y = \frac{X-\mu}{\sigma}\)
- Compute the list of counts \(C\) of \(Y\) in \((-\infty, -3)\), \([\tfrac{3i}{m}, \frac{3(i+1)}{m})\) for \(i\) in \(\{-m, \dots, m-1\}\) and \([3,+\infty)\).
- How many intervals do we have here?
- What is the expected number of entries of \(Y\) falling in \([3, +\infty)\)? (compute this using the cdf function). Change the value of \(n\) so that we have at least \(5\) expected counts in \([3, +\infty)\).
#Julia: use the broadcasting .<
sum(x .<= Y .< y) # counts in [x, y)
#R: use bitwise operator &
sum(Y >= x & Y < y) # counts in [x, y)
- Using the cdf of \(\mathcal N(0,1)\), compute the list of expected counts in the same intervals
- Compute the Chi-squared test statistic using the two preceeding questions. We recall that \(\psi(Y) = \sum_{i=1}^n \tfrac{(c_i - e_i)^2}{e_i}\) where \(c_i\) and \(e_i\) are the counts and expected counts.
- Summarize the preceeding questions into a function
trial_chisq(X, mu, sigma, m)
that normalizes \(X\), computes counts, expected counts and the chisq test statistic:
# function trial_chisq(X, mu, sigma, m)
# n = length(X)
# Y = (X-mu)/sigma
# Compute counts
# Compute expcounts
# Compute and Return chisq
- Using the previous question, write a function
monte_carlo_known
that computes \(N\) chi-squared test statistics on iid random samples \(X\sim \mathcal N(\mu, \sigma)^{\otimes n}\). It returns a listtrials
of length \(N\).
= 1000
N # function monte_carlo_known(N, mu, sigma, n, m)
# empty list trials
# for i = 1 ... N
# Generate X made of n iid gaussian (mu, sigma)
# append trial_chisq(X, mu, sigma, m) to trials
# endfor
# return trials
- Plot a histogram of a list of
trials
using a builtin function. Normalize it in density (area=1), and precise the bins(0:0.5:30)
. - What is a good distribution to approximate the histogram? Plot the distribution’s density and check that it fits the histogram. Vary the parameters \(m\), \(n\), and \(N\).
Now, we assume that \(\mu\) and \(\sigma\) are unknown.
- Given \(X\), compute to estimators
hatmu
andhatsigma
ofmu
andsigma
- Similarly to Q.7, write a function
monte_carlo_unknown(N, n, m)
that computes a Monte-Carlo simulation. \(\hat \mu\) and \(\hat \sigma\) must be computed for all trial \(i=1,\dots,N\). - Revisit questions 8 and 9, considering the case where \(\mu\) and \(\sigma\) are unknown. How does this affect the distribution of the histogram?
2. Application with Bitcoin
- Use your favorite AI to write the code to import the last \(500\) hourly close prices of bitcoin in usdt from binance. Plot the prices and compute the returns defined as \(R_t = \tfrac{P_t}{P_{t-1}}-1\), where \(P_t\) is the price at time \(t\) (in hours).
- We first test
\(H_0\): the mean of the returns is zero VS \(H_1\): it is nonzero.
Compute \(\hat \sigma\) asstd(R)
and the Student statistic \(\psi(R) = \sqrt{n}\tfrac{\overline R}{\hat \sigma}\). Compute the p-value using thecdf
function of a Student(499) (or Gaussian). Obtain the same result with a library function likeOneSampleTTest
in Julia,t.test
in R orttest_1samp
in Python - Plot a histogram of the returns, normalized in density. Plot on the same graph the density of a Gaussian of mean
mean(R)
and of stdstd(R)
. - Using the previous exercise with \(m=5\), compute a chi-squared statistic and an approximated p-value.
- Do a scatter plot of \((R_{t-1}, R_t)\). Do you see any correlation between \(R_{t-1}\) and \(R_t\)?
- Compute the correlation \(r\) between \((R_t)\) and \((R_{t-1})\).
- Compute the p-value of a two-sided Pearson’s correlation test, using the test statistic \(\tfrac{r}{\sqrt{1-r^2}} \sqrt{n-2}\) and the cdf of a Student distribution. Compare with the function
CorrelationTest
in Julia orcor.test
in R orpearsonr
in Python.