**PARTIAL CORRECTION**

**Instructions: duration = 2 hours, send a unique file (YourName[.pdf | .Rmd | .R]) to
robin.genuer@u-bordeaux.fr
containing your responses, your code and all outputs of interest.**

- Load the
`Ozone`

data, from the`mlbench`

package and**remove all missing values**.

*[you can apply the*.`na.omit()`

function to the`Ozone`

dataframe to do this]

We recall that for this dataset we want to be able to predict variable named`V4`

with all other variables.

We denote by \(n\) the number of observations of the dataset and by \(p\) the number of input variables.

```
data(Ozone, package = "mlbench")
ozoneComp <- na.omit(Ozone)
# the object ozoneComp, for 'Ozone complete', only contains rows form Ozone
# data which do not havec any missing value.
```

```
n <- nrow(ozoneComp)
p <- ncol(ozoneComp) - 1
```

- Build a random forest predictor, named
`rfSubSampHalf`

, which builds individual trees on sub-samples randomly drawn from the learning set without replacement and containing \(n/2\) observations (and every other parameter left to its default value). Output its OOB error.

```
set.seed(519473)
library(randomForest)
rfSubSampHalf <- randomForest(V4 ~ ., ozoneComp, replace = FALSE, sampsize = 0.5 *
n)
rfSubSampHalf
```

```
##
## Call:
## randomForest(formula = V4 ~ ., data = ozoneComp, replace = FALSE, sampsize = 0.5 * n)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 20.90126
## % Var explained: 68.69
```

- Build now a RF predictor, named
`rfSubSampAll`

, which builds individual trees on sub-samples randomly drawn from the learning set without replacement and containing \(n\) observations (and every other parameter left to its default value). Comment on its OOB error.

```
rfSubSampAll <- randomForest(V4 ~ ., ozoneComp, replace = FALSE, sampsize = n)
rfSubSampAll
```

```
##
## Call:
## randomForest(formula = V4 ~ ., data = ozoneComp, replace = FALSE, sampsize = n)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: NaN
## % Var explained: NaN
```

```
# It is not possible to compute the OOB error of this RF predictor, because
# no observations is OOB.
```

- More generally, we consider RF which build individual trees on sub-samples randomly drawn without replacement from the learning set.
**Using OOB error**, tune the size of the sub-sample in order to get the predictor with the best predictive performance.

*[Do not try more than 10 values, in order to limit the computational time]*

```
SampSizes <- floor(n * seq(from = 0.1, to = 0.9, by = 0.1))
oob_SampSizes <- rep(NA, length(SampSizes))
for (size in seq_along(SampSizes)) {
rfSubSamp <- randomForest(V4 ~ ., ozoneComp, replace = FALSE, sampsize = SampSizes[size])
oob_SampSizes[size] <- rfSubSamp$mse[rfSubSamp$ntree]
}
cbind(SampSizes, oob_SampSizes)
```

```
## SampSizes oob_SampSizes
## [1,] 20 27.02194
## [2,] 40 23.48591
## [3,] 60 22.66560
## [4,] 81 22.29204
## [5,] 101 21.59533
## [6,] 121 21.08400
## [7,] 142 21.04201
## [8,] 162 20.70542
## [9,] 182 21.16873
```

`plot(SampSizes, oob_SampSizes, type = "b")`

```
SampSizeOpt <- SampSizes[which.min(oob_SampSizes)]
SampSizeOpt
```

`## [1] 162`

- Build the RF predictor, named
`rfSubSampOpt`

, which uses the optimal value for the sub-sample size, found in the previous question. Compare the predictive performance of`rfSubSampOpt`

with a RF predictor built with all default parameters values. Comment.

```
rfSubSampleOpt <- randomForest(V4 ~ ., ozoneComp, replace = FALSE, sampsize = SampSizeOpt)
rfSubSampleOpt
```

```
##
## Call:
## randomForest(formula = V4 ~ ., data = ozoneComp, replace = FALSE, sampsize = SampSizeOpt)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 21.20545
## % Var explained: 68.23
```

```
rfDef <- randomForest(V4 ~ ., ozoneComp)
rfDef
```

```
##
## Call:
## randomForest(formula = V4 ~ ., data = ozoneComp)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 21.4754
## % Var explained: 67.83
```

```
# The rfSubSampOpt predictor has a lower estimated prediction error (with
# OOB error estimate). But the gain seems low compared to the price to pay
# in computation to optimize the size of the sub-sample. In addition, we
# could also optimize the size of the bootstrap sample drawn with
# replacement in RF to see if the performance can be increased.
```

- Build a RF predictor, named
`rfNoSampMtry1`

, which does not perform any sampling (neither Bootstrap sample, neither sub-sampling)**and**which randomly chooses only one variable at each node of each tree before optimizing the split of this node (and every other parameter left to its default value).

```
rfNoSampMtry1 <- randomForest(V4 ~ ., ozoneComp, mtry = 1, replace = FALSE,
sampsize = n)
rfNoSampMtry1
```

```
##
## Call:
## randomForest(formula = V4 ~ ., data = ozoneComp, mtry = 1, replace = FALSE, sampsize = n)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: NaN
## % Var explained: NaN
```

`# Again the OOB error is not available because no observations is OOB.`

- Use a
**3-fold**cross validation procedure to determine the performance of`rfNoSampMtry1`

.

```
x <- ozoneComp[-4]
y <- ozoneComp$V4
nfold <- 3
cvInd <- sample(rep(1:nfold, ceiling(n/nfold))[1:n])
pred_rfNoSampMtry1 <- rep(NA, n)
for (i in seq_along(1:nfold)) {
indTest <- which(cvInd == i)
xtrain <- x[-indTest, ]
ytrain <- y[-indTest]
xtest <- x[indTest, ]
rf <- randomForest(xtrain, ytrain, mtry = 1, replace = FALSE, sampsize = nrow(xtrain))
pred_rfNoSampMtry1[indTest] <- predict(rf, xtest)
}
errCV_rfNoSampMtry1 <- mean((pred_rfNoSampMtry1 - y)^2)
errCV_rfNoSampMtry1
```

`## [1] 20.93826`

- Tune the number of variables randomly chosen at each node of each tree, for RF with no sampling.

```
cvInd <- sample(rep(1:nfold, ceiling(n/nfold)), n)
nbvars <- 1:p
pred_rfNoSampMtrys <- matrix(NA, nrow = n, ncol = p)
for (i in seq_along(1:nfold)) {
indTest <- which(cvInd == i)
xtrain <- x[-indTest, ]
ytrain <- y[-indTest]
xtest <- x[indTest, ]
for (indnbv in seq_along(nbvars)) {
rf <- randomForest(xtrain, ytrain, mtry = nbvars[indnbv], replace = FALSE,
sampsize = nrow(xtrain))
pred_rfNoSampMtrys[indTest, indnbv] <- predict(rf, xtest)
}
}
errCV_rfNoSampMtrys <- apply(pred_rfNoSampMtrys, 2, function(Col) {
mean((Col - y)^2)
})
cbind(nbvars, errCV_rfNoSampMtrys)
```

```
## nbvars errCV_rfNoSampMtrys
## [1,] 1 21.48266
## [2,] 2 21.25397
## [3,] 3 22.14988
## [4,] 4 23.21163
## [5,] 5 24.43690
## [6,] 6 25.81228
## [7,] 7 26.85276
## [8,] 8 28.49716
## [9,] 9 30.19445
## [10,] 10 33.05720
## [11,] 11 36.49903
## [12,] 12 41.99242
```

```
rfNoSampMtryOpt <- nbvars[which.min(errCV_rfNoSampMtrys)]
rfNoSampMtryOpt
```

`## [1] 2`

`errCV_rfNoSampMtrys[which.min(errCV_rfNoSampMtrys)]`

`## [1] 21.25397`

**Using OOB error**, tune the number of variables randomly chosen at each node of each tree, for RF with default values for all other parameters.

```
oob_rfDefMtrys <- rep(NA, length(nbvars))
for (nbv in seq_along(nbvars)) {
rf <- randomForest(V4 ~ ., Ozone, na.action = na.omit, mtry = nbvars[nbv])
oob_rfDefMtrys[nbv] <- rf$mse[rf$ntree]
}
cbind(nbvars, oob_rfDefMtrys)
```

```
## nbvars oob_rfDefMtrys
## [1,] 1 19.95091
## [2,] 2 19.35733
## [3,] 3 20.71065
## [4,] 4 21.56197
## [5,] 5 22.34600
## [6,] 6 22.67703
## [7,] 7 24.04819
## [8,] 8 24.34117
## [9,] 9 24.66397
## [10,] 10 25.32097
## [11,] 11 26.11624
## [12,] 12 25.45021
```

```
rfDefMtryOpt <- nbvars[which.min(oob_rfDefMtrys)]
rfDefMtryOpt
```

`## [1] 2`

`oob_rfDefMtrys[which.min(oob_rfDefMtrys)]`

`## [1] 19.35733`

- Compare the error reached by the RF with no sampling trained with its associated best
`mtry`

value, and the error reached by the default RF with its associated best`mtry`

value. Comment.

```
# The two errors seem quite close, and, at least for this run, RF with no
# sampling reaches a slightly smaller error compared to RF with bootstrap.
# This may be surprising, but this suggests that the bootstrap step of RF
# may not be always mandatory.
```

- Do you think that the two previous errors you have just compared are effectively comparable ?

```
# Since OOB error and CV error are both fair estimates of the prediction
# error, the two previous errors are effectively comparable. To be sure to
# exactly compare the two RF variants, we could also estimated the error of
# classical RF (with bootstrap) using the same CV scheme.
```

- Write a few lines to highlight conclusions which can be drawn from your answers to the previous questions.

```
# Even if such a study must be carried on, it illustrates that the more
# important ingredient of RF is the random choice of variables to split on
# at each node, and that the bootstrap step may not be always necessary.
# Concerning the first part of this work, it seems that bootstrap sampling
# can safely replace with sampling without replacement, but this implies
# that we have to tune the sub-sample size.
# To conclude, we see that the sampling step, which is the first step of RF
# is not the crucial point of RF. Bootstrap could be replaced by
# sub-sampling without replacement, and the sampling step could even be
# removed. All of those commments only apply for the Ozone dataset, and a
# more in-depth analysis should be carried on to state stronger conclusions.
```