TP: Hypothesis Testing
0. Student Test
We observe \((X_1, \dots, X_{n_1})\) iid \(\mathcal N(\mu_1, \sigma_1^2)\) and \((Y_1, \dots, Y_{n_2})\) iid \(\mathcal N(\mu_2, \sigma_2^2)\). We assume that the vectors \(X\) and \(Y\) are independent. We want to test \(H_0\): \(\mu_1 = \mu_2\) VS \(H_1\): \(\mu_1 \neq \mu_2\).
We observe the data:
X=[-0.2657064426519085, -0.27538323622274347, 0.11419811877193782, 0.1158736466676504, 1.7071154417851981, 0.9306910454777643, 0.5834941669559498, -1.536447927372139, -1.4158768806157345, 1.0532694288697444, 1.2955133629200777, -0.4195557179577367]
Y=[-0.6452416530469819, 0.3662048411679129, -0.09943069837472361, 0.8738423322164134, 0.7163913715056272, -0.32450102319617485, 0.9159821874321818, -2.3583609849887224]- Compute the Welch test statistic \(\tfrac{\overline X - \overline Y}{\sqrt{\hat\sigma_1^2/n_1 + \hat\sigma_2^2/n_2}}\).
- Conclude using a Gaussian approximation (use the cdf of a \(\mathcal N(0,1)\)).
- Conclude using
UnequalVarianceTTest(Julia) ortest.welch(R) orscipy.stats.ttest_ind(a, b, equal_var=False)(Python). - Bonus. Conclude using a better chi-squared approximation. Compare these results to 3.
1. Monte Carlo and Chi-squared Tests
A statistician observes \(X = (X_1, \dots, X_n)\) where the \(X_i\)’s are iid of distribution \(P\). If the problem is to test whether \(P\) is Gaussian with known \(\mu\) and \(\sigma\), the problem is:
\[H_0: P=\mathcal N(\mu, \sigma^2) \quad \text{VS} \quad H_1: P\neq \mathcal N(\mu, \sigma^2)\]
If \(\mu\) and \(\sigma\) are unknown, the problem is: \[H_0: P\in \{\mathcal N(\mu, \sigma^2), \mu \in \mathbb R, \sigma >0\}\quad \text{VS} \quad H_1: P\not \in \{\mathcal N(\mu, \sigma^2), \mu \in \mathbb R, \sigma >0\}\]
We first assume that \(\mu\) and \(\sigma\) are known, and that:
mu = 0
sigma = 1
n = 100
m = 5This practical exercise aims to empirically demonstrate how a chi-squared test statistic converges to a chi-squared distribution in both known and unknown parameter scenarios. We will:
- Divide the observation space into \(2m+2\) disjoint intervals.
- Count how many observations fall into each interval for randomly generated data.
- Calculate the chi-squared test statistic for randomly generated data.
- Repeat this process \(1000\) times to build an empirical distribution (a histogram).
- The resulting empirical histogram should approach a theoretical chi-squared distribution as both the sample size \(n\) and the number of repetitions \(N\) approach infinity.
Questions
- Generate a vector \(X\) of \(n\) iid \(\mathcal N(\mu, \sigma^2)\) samples.
- Compute the vector \(Z = \frac{X-\mu}{\sigma}\).
- Compute the list of counts \(C\) of \(Z\) in \((-\infty, -3)\), \([\tfrac{3i}{m}, \tfrac{3(i+1)}{m})\) for \(i \in \{-m, \dots, m-1\}\), and \([3,+\infty)\).
- How many intervals are there in total?
- What is the expected number of entries of \(Z\) falling in \([3, +\infty)\)? (Compute this using the cdf function.) The expected count should be at least \(5\); what value of \(n\) ensures this?
# Julia: use the broadcasting .
sum(x .<= Z .< y) # counts in [x, y)# R: use bitwise operator &
sum(Z >= x & Z < y) # counts in [x, y)- Using the cdf of \(\mathcal N(0,1)\), compute the list of expected counts in the same intervals.
- Compute the chi-squared test statistic using the two preceding questions. Recall that \(\psi(Z) = \sum_{j} \tfrac{(c_j - e_j)^2}{e_j}\) where \(c_j\) and \(e_j\) are the observed and expected counts for interval \(j\).
- Summarize the preceding questions into a function
trial_chisq(X, mu, sigma, m)that normalizes \(X\), computes counts, expected counts, and the chi-squared test statistic:
# function trial_chisq(X, mu, sigma, m)
# n = length(X)
# Z = (X .- mu) ./ sigma
# Compute counts
# Compute expcounts
# Compute and return chisq- Using the previous question, write a function
monte_carlo_knownthat computes \(N\) chi-squared test statistics on iid random samples \(X\sim \mathcal N(\mu, \sigma^2)^{\otimes n}\). It returns a listtrialsof length \(N\).
N = 1000
# function monte_carlo_known(N, mu, sigma, n, m)
# empty list trials
# for i = 1 ... N
# Generate X of n iid Gaussian(mu, sigma)
# append trial_chisq(X, mu, sigma, m) to trials
# endfor
# return trials- Plot a histogram of
trialsusing a built-in function. Normalize it as a density (area \(= 1\)), and use bins(0:0.5:30). - What is a good theoretical distribution to approximate this histogram? Plot the distribution’s density and check that it fits. Vary the parameters \(m\), \(n\), and \(N\).
Now, we assume that \(\mu\) and \(\sigma\) are unknown.
- Given \(X\), compute estimators
hatmuandhatsigmaof \(\mu\) and \(\sigma\). - Similarly to Q.7, write a function
monte_carlo_unknown(N, n, m)that computes a Monte Carlo simulation. \(\hat\mu\) and \(\hat\sigma\) must be estimated separately for each trial \(i = 1, \dots, N\). - Revisit questions 8 and 9 for the unknown parameter case. How does estimating \(\mu\) and \(\sigma\) affect the distribution of the chi-squared statistic?
2. Application with Bitcoin
- Use your favorite AI to write the code to import the last \(500\) hourly close prices of Bitcoin in USDT from Binance. Plot the prices and compute the returns defined as \(R_t = \tfrac{P_t}{P_{t-1}}-1\), where \(P_t\) is the price at time \(t\) (in hours).
- We first test
\(H_0\): the mean of the returns is zero VS \(H_1\): it is nonzero.
Compute \(\hat\sigma\) asstd(R)and the Student statistic \(\psi(R) = \sqrt{n}\tfrac{\overline R}{\hat\sigma}\). Compute the p-value using the cdf of a Student\((n-1)\) distribution (or Gaussian). Obtain the same result with a library function:OneSampleTTestin Julia,t.testin R, orttest_1sampin Python. - Plot a histogram of the returns, normalized as a density. Plot on the same graph the density of a Gaussian with mean
mean(R)and standard deviationstd(R). - Using the function from Section 1 with \(m=5\), compute a chi-squared statistic and an approximated p-value.
- Do a scatter plot of \((R_{t-1}, R_t)\). Do you see any correlation between \(R_{t-1}\) and \(R_t\)?
- Compute the correlation \(r\) between \((R_t)\) and \((R_{t-1})\).
- Compute the p-value of a two-sided Pearson correlation test, using the test statistic \(\tfrac{r}{\sqrt{1-r^2}}\sqrt{n-2}\) and the cdf of a Student distribution. Compare with
CorrelationTestin Julia,cor.testin R, orpearsonrin Python.