PARTIAL CORRECTION

1 Ozone data

In this first part, we consider a dataset on Ozone pollution in Los Angeles in 1976.

Load the dataset, available in the mlbench package, using the following commands:

library("mlbench")
# if the package is not yet installed, execute: install.packages('mlbench')
# data(Ozone) loads the Ozone dataset into the environment
data(Ozone)
# str() gives the dataset structure
str(Ozone)

Consult the help page to get a full description of the dataset:

help(Ozone)

In this dataset, we have:

  1. Load the rpart package (for recursive partitioning), then read the beginning of rpart() function help :

    library("rpart")
    help(rpart)
  2. Tree building.

    1. Build one regression tree t with rpart function (be careful that the function requires for its first argument a formula like: \(\mathtt{y \sim x}\) and a dataset for its second one).
    t <- rpart(V4 ~ ., data = Ozone)
    1. Print the tree by just executing t and plot the tree using plot(t) followed by text(t). Try also to depict the output of summary(t).
    t
    plot(t)
    text(t)

    1. Using the predict() function, predict data of the learning set. The predict() function requires the tree object and the name of the dataset to predict (look at the predict.rpart() help page). Compute the empirical error rate of t (Hints: missing values of V4 variable must be deleted).
    t.pred <- predict(object = t, newdata = Ozone)
    err.emp <- mean((Ozone$V4 - t.pred)^2)
    # we get NA because of the missing values in V4
    naV4 <- which(is.na(Ozone$V4))
    err.emp <- mean((Ozone$V4[-naV4] - t.pred[-naV4])^2)
  3. Find which tree is built when the following commands are executed: (see rpart.control() help page for more information on parameters of the tree building):

    t1 <- rpart(V4 ~ ., data = Ozone, maxdepth = 1)
    t2 <- rpart(V4 ~ ., data = Ozone, minsplit = 2, cp = 0)
  4. Print and plot t1 and t2 and compute their empirical error rates.

    t1
    plot(t1)
    text(t1)

    t1.pred <- predict(t1, Ozone)
    err.emp1 <- mean((Ozone$V4[-naV4] - t1.pred[-naV4])^2)
    t2
    plot(t2)
    text(t2)

    t2.pred <- predict(t2, Ozone)
    err.emp2 <- mean((Ozone$V4[-naV4] - t2.pred[-naV4])^2)
  5. Pruning.

    1. From the maximal tree, print the results of the nested pruned sub-trees (look at rpart.object help page).
    t2$cptable
    1. Find the complexity value corresponding to the minimum of the cross-valication error (column xerror)
    ind_cpopt <- which.min(t2$cptable[, 4])
    cpopt <- t2$cptable[ind_cpopt, 1]
    1. Prune the maximal tree using the complexity value found in 5.b using the prune().
    t_opt <- prune(t2, cp = cpopt)
    plot(t_opt)
    text(t_opt)