1 Liver toxicity data

Load the dataset liver.toxicity available in the mixOmics package:

library("mixOmics")
data("liver.toxicity")
help(liver.toxicity)
x <- liver.toxicity$gene
y <- liver.toxicity$clinic$ALB.g.dL.
dim(x)
x[1:6, 1:6]
str(y)
  1. Build a RF with default parameters which computes the permutation variable importance index. Note its OOB error.

  2. Plot VI scores (sorted in decreasing order), then plot only the scores associated to the 100 most important variables.
    [The sort() function can be used to sort VI scores, and the index.return=TRUE must be specified to output the indices associated to the permutation performed during the sort]

  3. Find a subset of variables containing the most important variables, based on the previous graph (you can apply an elbow rule, like in PCA for example). Keep the indices of the selected variables. We note \(p_{\mathrm{sel}}\) the number of selected variables.

  4. Build a RF only using the previously selected variables. Comment on the associated OOB error.

  5. Estimate the prediction error of a RF only using the \(p_{\mathrm{sel}}\) most important variables with a 4-fold cross-validation procedure.
    [Variables can vary from one fold to the other, only \(p_{\mathrm{sel}}\) is fixed]

2 Ozone data

Load the dataset, available in the mlbench package:

library("mlbench")
data(Ozone)
str(Ozone)
  1. Build a RF with default parameters which computes the permutation variable importance index.

  2. Plot VI scores (sorted in decreasing order). Interprate the result.