Instructions: duration = 2 hours, send a unique file (YourName[.pdf | .Rmd | .R]) to
robin.genuer@u-bordeaux.fr
containing your responses, your code and all outputs of interest.


Advice: to control computational time, all random forests runs will only be done once. For example, even if you may be tempted to average several runs of random forests when tuning a parameter, in order to stabilize the results, I kindly advise you not to do so (at least during this practical work).

  1. Load the Ozone data, from the mlbench package and remove all missing values.
    [you can apply the na.omit() function to the Ozone dataframe to do this].
    We recall that for this dataset we want to be able to predict variable named V4 with all other variables.
    We denote by \(n\) the number of observations of the dataset and by \(p\) the number of input variables.

  2. Build a random forest predictor, named rfSubSampHalf, which builds individual trees on sub-samples randomly drawn from the learning set without replacement and containing \(n/2\) observations (and every other parameter left to its default value). Output its OOB error.

  3. Build now a RF predictor, named rfSubSampAll, which builds individual trees on sub-samples randomly drawn from the learning set without replacement and containing \(n\) observations (and every other parameter left to its default value). Comment on its OOB error.

  4. More generally, we consider RF which build individual trees on sub-samples randomly drawn from the learning set. Using OOB error, tune the size of the sub-sample in order to get the predictor with the best predictive performance.
    [Do not try more than 10 values, in order to limit the computational time]

  5. Build the RF predictor, named rfSubSampOpt, which uses the optimal value for the sub-sample size, found in the previous question. Compare the predictive performance of rfSubSampOpt with a RF predictor built with all default parameters values. Comment.

  6. Build a RF predictor, named rfNoSampMtry1, which does not perform any sampling (neither Bootstrap sample, neither sub-sampling) and which randomly chooses only one variable at each node of each tree before optimizing the split of this node (and every other parameter left to its default value).

  7. Use a 3-fold cross validation procedure to determine the performance of rfNoSampMtry1.

  8. Tune the number of variables randomly chosen at each node of each tree, for RF with no sampling.

  9. Using OOB error, tune the number of variables randomly chosen at each node of each tree, for RF with default values for all other parameters.

  10. Compare the error reached by the RF with no sampling trained with its associated best mtry value, and the error reached by the default RF with its associated best mtry value. Comment.

  11. Do you think that the two previous errors you have just compared are effectively comparable ?

  12. Write a few lines to highlight conclusions which can be drawn from your answers to the previous questions.