PARTIAL CORRECTION

Instructions: duration = 2 hours, send a unique file (YourName[.pdf | .Rmd | .R]) to
robin.genuer@u-bordeaux.fr
containing your responses, your code and all outputs of interest.


Advice: to control computational time, all random forests runs will only be done once. For example, even if you may be tempted to average several runs of random forests when tuning a parameter, in order to stabilize the results, I kindly advise you not to do so (at least during this practical work).

  1. Load the Ozone data, from the mlbench package and remove all missing values.
    [you can apply the na.omit() function to the Ozone dataframe to do this].
    We recall that for this dataset we want to be able to predict variable named V4 with all other variables.
    We denote by \(n\) the number of observations of the dataset and by \(p\) the number of input variables.
data(Ozone, package = "mlbench")
ozoneComp <- na.omit(Ozone)
# the object ozoneComp, for 'Ozone complete', only contains rows form Ozone
# data which do not havec any missing value.
n <- nrow(ozoneComp)
p <- ncol(ozoneComp) - 1
  1. Build a random forest predictor, named rfSubSampHalf, which builds individual trees on sub-samples randomly drawn from the learning set without replacement and containing \(n/2\) observations (and every other parameter left to its default value). Output its OOB error.
set.seed(519473)
library(randomForest)
rfSubSampHalf <- randomForest(V4 ~ ., ozoneComp, replace = FALSE, sampsize = 0.5 * 
    n)
rfSubSampHalf
## 
## Call:
##  randomForest(formula = V4 ~ ., data = ozoneComp, replace = FALSE,      sampsize = 0.5 * n) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 20.90126
##                     % Var explained: 68.69
  1. Build now a RF predictor, named rfSubSampAll, which builds individual trees on sub-samples randomly drawn from the learning set without replacement and containing \(n\) observations (and every other parameter left to its default value). Comment on its OOB error.
rfSubSampAll <- randomForest(V4 ~ ., ozoneComp, replace = FALSE, sampsize = n)
rfSubSampAll
## 
## Call:
##  randomForest(formula = V4 ~ ., data = ozoneComp, replace = FALSE,      sampsize = n) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: NaN
##                     % Var explained: NaN
# It is not possible to compute the OOB error of this RF predictor, because
# no observations is OOB.
  1. More generally, we consider RF which build individual trees on sub-samples randomly drawn without replacement from the learning set. Using OOB error, tune the size of the sub-sample in order to get the predictor with the best predictive performance.
    [Do not try more than 10 values, in order to limit the computational time]
SampSizes <- floor(n * seq(from = 0.1, to = 0.9, by = 0.1))
oob_SampSizes <- rep(NA, length(SampSizes))
for (size in seq_along(SampSizes)) {
    rfSubSamp <- randomForest(V4 ~ ., ozoneComp, replace = FALSE, sampsize = SampSizes[size])
    oob_SampSizes[size] <- rfSubSamp$mse[rfSubSamp$ntree]
}
cbind(SampSizes, oob_SampSizes)
##       SampSizes oob_SampSizes
##  [1,]        20      27.02194
##  [2,]        40      23.48591
##  [3,]        60      22.66560
##  [4,]        81      22.29204
##  [5,]       101      21.59533
##  [6,]       121      21.08400
##  [7,]       142      21.04201
##  [8,]       162      20.70542
##  [9,]       182      21.16873
plot(SampSizes, oob_SampSizes, type = "b")

SampSizeOpt <- SampSizes[which.min(oob_SampSizes)]
SampSizeOpt
## [1] 162
  1. Build the RF predictor, named rfSubSampOpt, which uses the optimal value for the sub-sample size, found in the previous question. Compare the predictive performance of rfSubSampOpt with a RF predictor built with all default parameters values. Comment.
rfSubSampleOpt <- randomForest(V4 ~ ., ozoneComp, replace = FALSE, sampsize = SampSizeOpt)
rfSubSampleOpt
## 
## Call:
##  randomForest(formula = V4 ~ ., data = ozoneComp, replace = FALSE,      sampsize = SampSizeOpt) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 21.20545
##                     % Var explained: 68.23
rfDef <- randomForest(V4 ~ ., ozoneComp)
rfDef
## 
## Call:
##  randomForest(formula = V4 ~ ., data = ozoneComp) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 21.4754
##                     % Var explained: 67.83
# The rfSubSampOpt predictor has a lower estimated prediction error (with
# OOB error estimate). But the gain seems low compared to the price to pay
# in computation to optimize the size of the sub-sample.  In addition, we
# could also optimize the size of the bootstrap sample drawn with
# replacement in RF to see if the performance can be increased.
  1. Build a RF predictor, named rfNoSampMtry1, which does not perform any sampling (neither Bootstrap sample, neither sub-sampling) and which randomly chooses only one variable at each node of each tree before optimizing the split of this node (and every other parameter left to its default value).
rfNoSampMtry1 <- randomForest(V4 ~ ., ozoneComp, mtry = 1, replace = FALSE, 
    sampsize = n)
rfNoSampMtry1
## 
## Call:
##  randomForest(formula = V4 ~ ., data = ozoneComp, mtry = 1, replace = FALSE,      sampsize = n) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: NaN
##                     % Var explained: NaN
# Again the OOB error is not available because no observations is OOB.
  1. Use a 3-fold cross validation procedure to determine the performance of rfNoSampMtry1.
x <- ozoneComp[-4]
y <- ozoneComp$V4
nfold <- 3
cvInd <- sample(rep(1:nfold, ceiling(n/nfold))[1:n])
pred_rfNoSampMtry1 <- rep(NA, n)

for (i in seq_along(1:nfold)) {
    indTest <- which(cvInd == i)
    xtrain <- x[-indTest, ]
    ytrain <- y[-indTest]
    xtest <- x[indTest, ]
    
    rf <- randomForest(xtrain, ytrain, mtry = 1, replace = FALSE, sampsize = nrow(xtrain))
    
    pred_rfNoSampMtry1[indTest] <- predict(rf, xtest)
}

errCV_rfNoSampMtry1 <- mean((pred_rfNoSampMtry1 - y)^2)
errCV_rfNoSampMtry1
## [1] 20.93826
  1. Tune the number of variables randomly chosen at each node of each tree, for RF with no sampling.
cvInd <- sample(rep(1:nfold, ceiling(n/nfold)), n)
nbvars <- 1:p
pred_rfNoSampMtrys <- matrix(NA, nrow = n, ncol = p)

for (i in seq_along(1:nfold)) {
    indTest <- which(cvInd == i)
    xtrain <- x[-indTest, ]
    ytrain <- y[-indTest]
    xtest <- x[indTest, ]
    
    for (indnbv in seq_along(nbvars)) {
        rf <- randomForest(xtrain, ytrain, mtry = nbvars[indnbv], replace = FALSE, 
            sampsize = nrow(xtrain))
        pred_rfNoSampMtrys[indTest, indnbv] <- predict(rf, xtest)
    }
}

errCV_rfNoSampMtrys <- apply(pred_rfNoSampMtrys, 2, function(Col) {
    mean((Col - y)^2)
})

cbind(nbvars, errCV_rfNoSampMtrys)
##       nbvars errCV_rfNoSampMtrys
##  [1,]      1            21.48266
##  [2,]      2            21.25397
##  [3,]      3            22.14988
##  [4,]      4            23.21163
##  [5,]      5            24.43690
##  [6,]      6            25.81228
##  [7,]      7            26.85276
##  [8,]      8            28.49716
##  [9,]      9            30.19445
## [10,]     10            33.05720
## [11,]     11            36.49903
## [12,]     12            41.99242
rfNoSampMtryOpt <- nbvars[which.min(errCV_rfNoSampMtrys)]
rfNoSampMtryOpt
## [1] 2
errCV_rfNoSampMtrys[which.min(errCV_rfNoSampMtrys)]
## [1] 21.25397
  1. Using OOB error, tune the number of variables randomly chosen at each node of each tree, for RF with default values for all other parameters.
oob_rfDefMtrys <- rep(NA, length(nbvars))
for (nbv in seq_along(nbvars)) {
    rf <- randomForest(V4 ~ ., Ozone, na.action = na.omit, mtry = nbvars[nbv])
    oob_rfDefMtrys[nbv] <- rf$mse[rf$ntree]
}
cbind(nbvars, oob_rfDefMtrys)
##       nbvars oob_rfDefMtrys
##  [1,]      1       19.95091
##  [2,]      2       19.35733
##  [3,]      3       20.71065
##  [4,]      4       21.56197
##  [5,]      5       22.34600
##  [6,]      6       22.67703
##  [7,]      7       24.04819
##  [8,]      8       24.34117
##  [9,]      9       24.66397
## [10,]     10       25.32097
## [11,]     11       26.11624
## [12,]     12       25.45021
rfDefMtryOpt <- nbvars[which.min(oob_rfDefMtrys)]
rfDefMtryOpt
## [1] 2
oob_rfDefMtrys[which.min(oob_rfDefMtrys)]
## [1] 19.35733
  1. Compare the error reached by the RF with no sampling trained with its associated best mtry value, and the error reached by the default RF with its associated best mtry value. Comment.
# The two errors seem quite close, and, at least for this run, RF with no
# sampling reaches a slightly smaller error compared to RF with bootstrap.
# This may be surprising, but this suggests that the bootstrap step of RF
# may not be always mandatory.
  1. Do you think that the two previous errors you have just compared are effectively comparable ?
# Since OOB error and CV error are both fair estimates of the prediction
# error, the two previous errors are effectively comparable. To be sure to
# exactly compare the two RF variants, we could also estimated the error of
# classical RF (with bootstrap) using the same CV scheme.
  1. Write a few lines to highlight conclusions which can be drawn from your answers to the previous questions.
# Even if such a study must be carried on, it illustrates that the more
# important ingredient of RF is the random choice of variables to split on
# at each node, and that the bootstrap step may not be always necessary.

# Concerning the first part of this work, it seems that bootstrap sampling
# can safely replace with sampling without replacement, but this implies
# that we have to tune the sub-sample size.

# To conclude, we see that the sampling step, which is the first step of RF
# is not the crucial point of RF. Bootstrap could be replaced by
# sub-sampling without replacement, and the sampling step could even be
# removed. All of those commments only apply for the Ozone dataset, and a
# more in-depth analysis should be carried on to state stronger conclusions.