TD2-5

Exercise 1. R Lab on Eucalyptus Data

We want to explain the height of eucalyptus trees based on their circumference using simple linear regression. We have measurements of heights (ht) and circumferences (circ) of 1429 eucalyptus trees, which are found in the file “eucalyptus.txt”.

Extract and plot the data in a scatter plot.
Perform the regression \(y = \beta_1 + \beta_2 x + \varepsilon\) where \(y\) represents the height and \(x\) the circumference. Comment on the results.
What is the theoretical formula for \(\hat{\beta}_1\) and \(\hat{\beta}_2\)? Retrieve the estimates provided by R using it.
Calculate a 95% confidence interval for \(\beta_1\) and \(\beta_2\), assuming the normality of the data.
If the noise \(\varepsilon\) does not follow a normal distribution, do the previous confidence intervals remain valid?
Plot the estimator of the regression line and a 95% confidence interval for it. What do you conclude about the quality of the estimation?
We now want to predict the height of a new series of eucalyptus trees with circumferences 50, 100, 150, and 200. Give the estimators of the height of each of them and the associated 95% prediction intervals, assuming the normality of the data.
Add to the graphical representation from question 6 the prediction intervals (associated with the same circumference values).
If the noise \(\varepsilon\) does not follow a normal distribution, do the previous prediction intervals remain valid?

Exercise 2. The Fisher Test and \(R^2\)

Consider a multiple linear regression model \(y = X\beta + \varepsilon\) where \(\beta \in \mathbb{R}^p\), \(X\) is a matrix of size \((n, p)\) and \(\varepsilon\) is a random vector of size \(n\), centered with covariance matrix \(\sigma^2 I_n\) (\(I_n\) is the identity matrix).

We want to test \(q\) linear constraints on the parameter \(\beta\), that is to test \(H_0: R\beta = 0\) against \(H_1: R\beta \neq 0\), where \(R\) is a matrix of size \((q, p)\).

We denote SSR the sum of squared residuals of the initial model, and \(\text{SSR}_c\) the sum of squared residuals of the constrained model (i.e., for which hypothesis \(H_0\) is verified).

Recall the statistic used to perform this test. Denote it \(F\) and give its expression as a function of SSR and \(\text{SSR}_c\).
What distribution does this statistic follow under \(H_0\) when \(\varepsilon\) follows a normal distribution? What can be said about its distribution if no normality assumption is made on \(\varepsilon\)?
Show that if a constant is present in the constrained model, \[F = \frac{R^2 - R^2_c}{1 - R^2} \cdot \frac{n - p}{q}\] where \(R^2\) (respectively \(R^2_c\)) denotes the coefficient of determination of the initial model (respectively of the constrained model).

Exercise 3. R Lab on Ice Cream Consumption

We study ice cream consumption in the United States over a period of 30 weeks from March 18, 1950 to July 11, 1953. The variables are the period (from week 1 to week 30), and on average for each period: ice cream consumption per person (“Consumption”, in 1/2 liter), price of ice cream (“Price”, in dollars), average weekly salary per household (“Income”, in dollars), and temperature (“Temp”, in degrees Fahrenheit). The data are available in the file “icecream-R.dat”.

Extract the data and plot consumption as a function of the different variables.
We propose to linearly regress consumption on the three variables “Price”, “Income” and “Temp”, also assuming that a constant is present in the model. Denote the constant \(\beta_1\) and the three coefficients associated with the previous variables \(\beta_2\), \(\beta_3\) and \(\beta_4\) respectively. Perform the estimation phase of this regression and comment on the sign of the estimated coefficients.
Test the global significance of the proposed model, i.e. \(H_0: \beta_2 = \beta_3 = \beta_4 = 0\), using the global Fisher test.
Test the significance of the variable “Price” in this model at the 5% level. Similarly test the significance of “Income”, then of “Temp”.
Compare the previous complete model and the model without the “Price” variable using a Fisher test:
1. Basing the calculation on the sum of squared residuals of each model;
2. Basing the calculation on the coefficient of determination of each model;
3. Using the linearHypothesis function from the car library.
4. Using the anova function.
What is the difference between this test and the Student’s t-test of significance of the “Price” variable?
Compare the complete model and the model without the “Price” variable and without the constant using a Fisher test. Proceed according to the 4 methods described above. Comment.
We now want to predict ice cream consumption for the following data: \(\text{Price} = 0.3\), \(\text{Income} = 85\) and \(\text{Temp} = 65\). Propose the prediction that seems best to you given the models studied previously. Give a 95% prediction interval around this prediction.
Under what hypothesis is the previous prediction interval valid? Verify it by observing the QQ-plot of the regression residuals and by performing a statistical test.
Verify the other hypotheses by recalling the definition and calculating the VIF (Variance Inflation Factor) of each explanatory variable and by performing a graphical analysis of the residuals.
Observe the 3D scatter plot of the variables “Consumption”, “Income” and “Temp”, and the fit by the linear model, using scatter3d from the car library.

Exercise 4. The Multiple Correlation Coefficient with or without Constant

Consider the regression model \[y_i = \beta_0 + \beta_1 x_i^{(1)} + \cdots + \beta_p x_i^{(p)} + \varepsilon_i, \quad i = 1, \ldots, n, \quad (*)\]

where the variables \(\varepsilon_i\) are centered, with variance \(\sigma^2\) and uncorrelated. Let \(Y = (y_1, \ldots, y_n)^T\), \(X^{(k)} = (x_1^{(k)}, \ldots, x_n^{(k)})^T\) and \(\mathbf{1} = (1, \ldots, 1)^T\). We assume that the variables \(X^{(k)}\) are not linearly related to \(\mathbf{1}\). We denote \(\bar{y}\) the empirical mean of \(Y\) and \(\hat{Y} = \hat{\beta}_0 \mathbf{1} + \hat{\beta}_1 X^{(1)} + \cdots + \hat{\beta}_p X^{(p)}\) where the estimators are those obtained by ordinary least squares.

What does \(\hat{Y}\) represent geometrically? Draw a diagram showing the vectors \(Y\), \(\hat{Y}\), \(\bar{y}\mathbf{1}\), \(Y - \bar{y}\mathbf{1}\), \(\hat{Y} - \bar{y}\mathbf{1}\) and \(\hat{\varepsilon}\).
Deduce the following equalities:
1. \(\sum_{i=1}^n y_i^2 = \sum_{i=1}^n \hat{\varepsilon}_i^2 + \sum_{i=1}^n \hat{y}_i^2\)
2. \(\sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n \hat{\varepsilon}_i^2 + \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\)
Consider the ratios: \[R_1^2 = \frac{\sum_{i=1}^n \hat{y}_i^2}{\sum_{i=1}^n y_i^2} \quad \quad R_2^2 = \frac{\sum_{i=1}^n (\hat{y}_i - \bar{y})^2}{\sum_{i=1}^n (y_i - \bar{y})^2}\]

Justify that \(R_1^2 \geq R_2^2\). In which case do we have equality?
What is the definition of the multiple correlation coefficient for model \((*)\)?
We now consider a regression model without constant, i.e., we set \(\beta_0 = 0\) in \((*)\). Do the equalities shown in 2) remain valid? What is in this case the definition of the multiple correlation coefficient?
After estimating model \((*)\) with or without the constant, we obtain \(R^2 = 0.72\) with the constant and \(R^2 = 0.96\) without the constant. Is the introduction of the constant relevant?

Exercise 5. The Interpretation of \(R^2\) as Multiple Correlation Coefficient

We place ourselves in a regression model containing a constant. We define \[\rho(Y, X) = \sup_{\beta} \text{cor}(Y, X\beta)\] where cor denotes the empirical correlation. This quantity is therefore the maximum correlation that can be obtained between \(Y\) and a linear combination of the explanatory variables.

Show that \[\text{cor}(Y, X\beta) = \frac{(X\hat{\beta} - \bar{Y}\mathbf{1})^T(X\beta - \bar{X\beta}\mathbf{1})}{\|Y - \bar{Y}\mathbf{1}\| \|X\beta - \bar{X\beta}\mathbf{1}\|}\] where \(\hat{\beta}\) is the OLS estimator of the regression of \(Y\) on \(X\), and \(\bar{X}\) denotes the vector of the \(p\) empirical means of each explanatory variable.
Deduce that for all \(\beta\), \(\text{cor}(Y, X\beta)^2 \leq R^2\), where \(R^2\) is the multiple correlation coefficient of the regression of \(Y\) on \(X\).
Show that the previous bound is attained when \(\beta = \hat{\beta}\).
Conclude that \(\rho(Y, X)^2 = R^2\) justifying the terminology “multiple correlation coefficient”.

Exercise 6. Effect of Multicollinearity

Consider a model with two explanatory variables, assumed to be centered. From the estimation on \(n\) individuals, we obtained the following matrices \(X^T X\) and \(X^T Y\): \[X^T X = \begin{pmatrix} 200 & 150 \\ 150 & 113 \end{pmatrix} \quad X^T Y = \begin{pmatrix} 350 \\ 263 \end{pmatrix}\]

The removal of one observation has modified these matrices as follows: \[X^T X = \begin{pmatrix} 199 & 149 \\ 149 & 112 \end{pmatrix} \quad X^T Y = \begin{pmatrix} 347.5 \\ 261.5 \end{pmatrix}\]

Calculate the estimated coefficients of the regression in both cases.
Calculate the linear correlation coefficient between the two explanatory variables.
Comment.

Exercise 7. Modeling the Daily Maximum Ozone Concentration

The data set “ozone.txt” contains the maximum ozone concentration (maxO3) measured each day of summer 2001 in Rennes. It also contains temperatures, cloudiness and wind speed measured at 9am, 12pm and 3pm (respectively T9, T12, T15, Ne9, Ne12, Ne15 and Vx9, Vx12, Vx15), as well as the main wind direction and the presence or absence of rain. We want to best explain the ozone concentration using the variables available in the dataset.

Analyze the scatter plot and the linear correlation between maxO3 and each of the available quantitative variables (i.e., T9, T12, T15, Ne9, Ne12, Ne15, Vx9, Vx12 and Vx15). Is it reasonable to assume that there is a linear relationship between maxO3 and these variables?
Fit the linear regression model explaining maxO3 as a function of all the previous quantitative variables. Test the significance of each of the explanatory variables in this model. Is the result in agreement with the observations from the previous question? Also test the global significance of the model. Comment.
Calculate the VIF (Variance Inflation Factor) for each of the explanatory variables of the previous model. How do these values explain the results of the Student’s t-tests performed above?
We decide to remove variables from the previous model. Which variables does it seem natural to remove in view of the previous question? Fit the new proposed model and repeat the analyses performed in the two previous questions.
Implement an automatic selection of the best possible sub-model of the “large” model fitted in question 2, according to the BIC criterion. You can use the regsubsets function from the leaps library (then plot.regsubsets) or step. Compare the retained model with the model chosen in the previous question.
Apply the previous automatic selection based on criteria other than BIC. Are the retained models the same? If not, which seems preferable?
Analyze the residuals of the model selected in the previous question through graphical representations and by performing tests of homoscedasticity and non-correlation of residuals. Do all the assumptions of a linear model seem to be verified?
In order to solve the problem of auto-correlation of residuals, we propose to add the previous day’s ozone maximum to the model. Create this variable, which we will call maxO3v and add it to the dataset. Do we observe a linear relationship between maxO3 and maxO3v?
Fit the regression model containing maxO3v as an additional explanatory variable. Analyze the fitting results: are the assumptions of a linear model verified?
Compare this last model to the model without maxO3v using a Fisher test and by comparing the different selection criteria (AIC, BIC, Mallows’ Cp, adjusted \(R^2\)).

Exercise 8. Comparison of Model Selection Criteria

Consider a linear regression model aimed at explaining \(Y\) as a function of variables \(X^{(1)}, \ldots, X^{(p)}\). We want to choose between the model with \(X^{(p)}\) and the model without \(X^{(p)}\) (the other variables being included in both cases), based on a sample of \(n\) individuals.

We denote \(F\) the statistic: \[F = (n - p) \frac{\text{SSR}_c - \text{SSR}}{\text{SSR}}\] where SSR denotes the sum of squared residuals in the model with \(X^{(p)}\), and \(\text{SSR}_c\) denotes the sum of squared residuals in the model without \(X^{(p)}\).

By applying a Fisher test of nested models, according to which decision rule, based on \(F\), will we choose to include the variable \(X^{(p)}\) in the model?
We recall that the adjusted \(R^2\) in a model with \(k\) variables and \(n\) individuals is defined by \[R^2_a = 1 - \frac{n - 1}{n - k} \frac{\text{SSR}_k}{\text{SST}}\] where \(\text{SSR}_k\) denotes the sum of squared residuals in the model, and SST the total sum of squares.

Show that we will decide to include \(X^{(p)}\) according to the adjusted \(R^2\) criterion if \(F > 1\).
We recall that the Mallows’ Cp in a model with \(k\) variables and \(n\) individuals is defined by \[C_p = \frac{\text{SSR}_k}{\hat{\sigma}^2} - n + 2k\] where \(\text{SSR}_k\) denotes the sum of squared residuals in the model, and \(\hat{\sigma}^2\) is an estimator of \(\sigma^2\) based on the largest possible model. We will take here \(\hat{\sigma}^2 = \text{SSR}/(n - p)\), where SSR denotes the sum of squared residuals in the model with \(X^{(p)}\).

Show that we will decide to include \(X^{(p)}\) according to the Mallows’ Cp criterion if \(F > 2\).
We recall that the AIC criterion in a model with \(k\) variables, with \(n\) individuals, with Gaussian residuals, is defined by \[\text{AIC} = n(1 + \log(2\pi)) + n \log \frac{\text{SSR}_k}{n} + 2(k + 1)\] where \(\text{SSR}_k\) denotes the sum of squared residuals in the model.

Show that we will decide to include \(X^{(p)}\) according to the AIC criterion if \(F > (n - p)(e^{2/n} - 1)\).
We recall that the BIC criterion (also sometimes called SBC) in a model with \(k\) variables, with \(n\) individuals, with Gaussian residuals, is defined by \[\text{BIC} = n(1 + \log(2\pi)) + n \log \frac{\text{SSR}_k}{n} + \log(n)(k + 1)\] where \(\text{SSR}_k\) denotes the sum of squared residuals in the model.

Show that we will decide to include \(X^{(p)}\) according to the BIC criterion if \(F > (n - p)(e^{\log(n)/n} - 1)\).
By admitting that the 95% quantiles of a Fisher distribution with degrees of freedom \((1, \nu)\) take their values in the interval \([3.8, 5]\) as soon as \(\nu > 10\), rank the previous criteria from the most conservative (i.e., tending to more easily reject the introduction of \(X^{(p)}\)) to the least conservative (i.e., tending to more easily accept the introduction of \(X^{(p)}\)). You can use a Taylor expansion for the study of the AIC and BIC criteria, assuming that \(n\) is sufficiently large.

Exercise 9. Over-fitting Probability of Selection Criteria

We place ourselves in the framework of the previous exercise, but we additionally assume that the variable \(X^{(p)}\) is not significant in the model (i.e., its coefficient is zero in the regression) and that the residuals are i.i.d. Gaussian. We admit the results stated in the questions of the previous exercise.

What distribution does the statistic \(F\) follow? Show that when \(n \to \infty\), this distribution is equivalent to a \(\chi^2(1)\) distribution.
When implementing the Fisher test of nested models at level \(\alpha \in [0, 1]\), what is the probability of (wrongly) deciding to include the variable \(X^{(p)}\) in the model?
What does the previous probability tend to if we base the decision on the adjusted \(R^2\)?
Same question if the decision is based on Mallows’ Cp.
Same question if the decision is based on the AIC criterion.
Same question if the decision is based on the BIC criterion.
Which criterion is preferable to choose if we want to minimize the risk of including an extra variable in the model?

Supplement: In the opposite situation where \(X^{(p)}\) is significant in the model and it is therefore preferable to include it, we can show (but it is more difficult) that by relying on any of the above criteria, the probability of (wrongly) deciding not to include \(X^{(p)}\) tends to 0 when \(n \to \infty\).

Exercise 10. ANCOVA: Return to Ozone Modeling

We again consider the “ozone.txt” data studied in Exercise 7. We want to take advantage of the qualitative variables present in the dataset (i.e., the presence or absence of rain, and the main wind direction) to possibly improve the model constructed in Exercise 7.

We take the model selected in Exercise 7, namely “maxO3” as a function of “T12”, “Ne9”, “Vx9” and “maxO3v” where “maxO3v” represents the maximum ozone concentration from the previous day (create this variable if needed). Fit this model on the data.
Graphically represent “maxO3” as a function of the presence of rain. Does a relationship seem present? Confirm it by a statistical test.
Add to the model from the first question the variable “rain” in the most general way possible (i.e., including an interaction with each variable in addition to an effect on the constant). Test the significance of these additions by performing a Fisher test of nested models between this model and the initial model.
Similarly test the simpler model in which only an additive effect of the “rain” variable is integrated, and not its interactions with the other variables. Is the result in disagreement with the graphical analysis from question 2? How to explain the result?
Similarly: graphically represent “maxO3” as a function of wind direction and study the relevance of including a wind effect in the initial model.

Exercise 11. Generalized Least Squares

Let a multiple linear regression model \[Y = X\beta + \varepsilon\] where \(\beta \in \mathbb{R}^p\), \(X\) is a matrix of size \(n \times p\) and \(\varepsilon\) is a random vector of size \(n\), centered. We consider here the situation where the variables \(\varepsilon_i\) are no longer homoscedastic and uncorrelated, but generally \(\mathbb{V}(\varepsilon) = \Sigma\) where \(\Sigma\) is an invertible matrix. We assume in this exercise that \(\Sigma\) is known (it should be estimated in practice).

Specify the matrix \(\Sigma\) when the variables \(\varepsilon_i\) are uncorrelated but heteroscedastic with variance \(\sigma_i^2\) (\(i = 1, \ldots, n\)).
Determine the expectation and the variance of the ordinary least squares estimator \(\hat{\beta}\) (in the general case of any matrix \(\Sigma\)).
For \(S \in \mathbb{R}^n\) and \(T \in \mathbb{R}^n\), we define the scalar product between \(S\) and \(T\) associated with the matrix \(\Sigma^{-1}\) by \(S^T \Sigma^{-1} T\), and thus the norm of \(T\) associated with \(\Sigma^{-1}\) is \(\|T\|_\Sigma^2 = T^T \Sigma^{-1} T\). Show that the explicit form of the generalized least squares estimator \(\hat{\beta}_G\) defined as the minimizer of \(\|Y - X\beta\|_\Sigma\) is \[\hat{\beta}_G = (X^T \Sigma^{-1} X)^{-1} X^T \Sigma^{-1} Y\]

Deduce its expectation and its variance.
Show that the covariance matrix between \(\hat{\beta}\) and \(\hat{\beta}_G\) is equal to the variance-covariance matrix of \(\hat{\beta}_G\). Deduce that \(\hat{\beta}_G\) is better than \(\hat{\beta}\) in the sense of quadratic cost.