1 Ozone data

In this first part, we consider a dataset on Ozone pollution in Los Angeles in 1976.

Load the dataset, available in the mlbench package, using the following commands:

# if the package is not yet installed, execute: install.packages('mlbench')
# data(Ozone) loads the Ozone dataset into the environment
# str() gives the dataset structure

Consult the help page to get a full description of the dataset:


In this dataset, we have:

  1. Load the rpart package (for recursive partitioning), then read the beginning of rpart() function help :

  2. Tree building.

    1. Build one regression tree t with rpart function (be careful that the function requires for its first argument a formula like: \(\mathtt{y \sim x}\) and a dataset for its second one).

    2. Print the tree by just executing t and plot the tree using plot(t) followed by text(t). Try also to depict the output of summary(t).

    3. Using the predict() function, predict data of the learning set. The predict() function requires the tree object and the name of the dataset to predict (look at the predict.rpart() help page). Compute the empirical error rate of t (Hints: missing values of V4 variable must be deleted).

  3. Find which tree is built when the following commands are executed: (see rpart.control() help page for more information on parameters of the tree building):

    t1 <- rpart(V4 ~ ., data = Ozone, maxdepth = 1)
    t2 <- rpart(V4 ~ ., data = Ozone, minsplit = 2, cp = 0)
  4. Print and plot t1 and t2 and compute their empirical error rates.

  5. Pruning.

    1. From the maximal tree, print the results of the nested pruned sub-trees (look at rpart.object help page).

    2. Find the complexity value corresponding to the minimum of the cross-valication error (column xerror)

    3. Prune the maximal tree using the complexity value found in 5.b using the prune().

2 Liver toxicity data

Load the dataset liver.toxicity available in the mixOmics package:

# install.packages('mixOmics')
x <- liver.toxicity$gene
y <- liver.toxicity$clinic$ALB.g.dL.
x[1:6, 1:6]

Analyze this dataset using a regression tree, as in Section 1: maximal tree building, pruning…