**PARTIAL CORRECTION**

Load the dataset `liver.toxicity`

available in the `mixOmics`

package:

```
# install.packages('mixOmics')
data("liver.toxicity", package = "mixOmics")
help(liver.toxicity, package = "mixOmics")
x <- liver.toxicity$gene
y <- liver.toxicity$clinic$ALB.g.dL.
dim(x)
x[1:6, 1:6]
str(y)
```

Load the

`randomForest`

package and consult the help page of the`randomForest()`

function:Build, with the

`randomForest()`

function, a Bagging predictor, named`bag`

, which involves maximal trees. Keep all default parameters for now.

Determine its OOB error using the output print.`## ## Call: ## randomForest(x = x, y = y, mtry = ncol(x)) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 3116 ## ## Mean of squared residuals: 0.04300151 ## % Var explained: 35.66`

Build now a random forests predictor, named

`rf`

, which randomly selects \(p/3\) (the default value) variables at each node before optimizing the split.

Compare its OOB error with the previous one.

Apply the`plot()`

function to the object`rf`

.`## ## Call: ## randomForest(x = x, y = y) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 1038 ## ## Mean of squared residuals: 0.04329369 ## % Var explained: 35.22`

Tune the number of trees in the forest, while letting the number of variables selected at each node fixed to its default value.

Use the previous plot to guide your choice of tested values.`# The previous plot informs us that a number of trees smaller than 100 is # too small. Again from the plot, we see that a number of trees between 200 # and 500 is reasonable. We can try to increase the number of trees and see # what happens (take car of the computational time for this example): rf1000 <- randomForest(x, y, ntree = 1000) plot(rf1000)`

Tune now the number of variables selected at each node, while letting the number of trees to its value found in question 3.

`# The default value is 1038 in our case. We only test a few values of mtry # here, because of large computational time. p <- ncol(x) nbvars <- floor(c(sqrt(p), p/10, p/5, p/3, p/2, 2 * p/3)) oobs_mtry <- rep(NA, length(nbvars)) for (nbvInd in seq_along(nbvars)) { rf <- randomForest(x, y, ntree = nbtreeopt, mtry = nbvars[nbvInd]) oobs_mtry[nbvInd] <- rf$mse[nbtreeopt] } cbind(nbvars, oobs_mtry)`

`## nbvars oobs_mtry ## [1,] 55 0.04382080 ## [2,] 311 0.04115958 ## [3,] 623 0.04168630 ## [4,] 1038 0.04107949 ## [5,] 1558 0.04405020 ## [6,] 2077 0.04120316`

`## [1] 1038`

`## [1] 0.04107949`

Build a random forest predictor with the following parameters values:

`(replace = FALSE, sampsize = nrow(x), mtry = ncol(x), ntree = 10, maxnodes = 5)`

What are the caracteristics of this RF ?

Look carefully at the`forest`

component of the resulting object (which is a list) and figure out what its content means.`# Bootstrap samples are actually not bootstrap samples: draws are made # without replacement, and the number of observations drawn is fixed to the # total number of observations in the dataset. So, in this case, all trees # are built on the full original datatset. # Furthermore, all variables are chosen at each node because mtry = p. # Finally, the forest contains 10 trees and the trees have 5 leaves maximum. # If we look at the details of the resulting object we see that all trees # are not the same even if it should be the case ! This comes from the fact # that there are somes ties during the splitting process (several splits # lead to the same decrease of the heterogeneity of nodes). In this case, # the final choice of the split is made randomly uniformly among all # ex-aequo splits.`

The goal in this section is to estimate the prediction error of a RF and a regression tree on Liver toxicity data using 10-fold cross validation.

Randomly choose which observation belongs to each

*fold*of the 10-fold cross-validation.`## [1] 5 1 1 8 5 9 2 3 6 8 5 7 10 6 7 8 3 1 10 2 7 4 6 ## [24] 6 3 3 4 9 10 4 8 9 1 10 7 2 4 6 2 5 10 9 5 4 7 4 ## [47] 2 6 2 1 2 3 4 8 9 1 10 3 3 5 9 7 8 1`

Build a RF and a CART tree on all Liver tocixity data except those belonging to the first fold.

Predict the observations belonging to the firest fold with both predictors (use the

`predict()`

method for that).`xtest1 <- x[indi_test1, ] ytest1 <- y[indi_test1] predrf1 <- predict(rf1, xtest1) predtree1 <- predict(tree1, xtest1) cbind(predrf1, predtree1, ytest1)`

`## predrf1 predtree1 ytest1 ## ID203 4.882577 4.812500 5.0 ## ID204 4.966390 4.966667 5.0 ## ID303 4.926883 4.812500 5.0 ## ID402 5.117233 5.175000 5.4 ## ID503 5.138120 5.175000 5.1 ## ID512 4.798657 4.571429 4.7 ## ID524 4.781863 4.571429 4.6`

Compute the 10-fold cross validation error estimate for the RF and the CART tree. Comment.

`pred_RF_def <- rep(NA, n) pred_tree_def <- rep(NA, n) for (i in 1:10) { indi_test <- which(cvInd == i) xtrain <- x[-indi_test, ] ytrain <- y[-indi_test] xtest <- x[indi_test, ] rf <- randomForest(xtrain, ytrain) df <- data.frame(xtrain, albu = ytrain) arbre <- rpart::rpart(albu ~ ., df) pred_RF_def[indi_test] <- predict(rf, xtest) pred_tree_def[indi_test] <- predict(arbre, xtest) } errCV_RF_def <- mean((pred_RF_def - y)^2) errCV_tree_def <- mean((pred_tree_def - y)^2) cbind(errCV_RF_def, errCV_tree_def)`

`## errCV_RF_def errCV_tree_def ## [1,] 0.04238831 0.1012918`