This practical work is based on an initial document written by Boris Hejblum

Load the mixOmics package and liver.toxicity data:

library(mixOmics)
data("liver.toxicity")

As before, we want to predict the albumine level using gene expression data. Furthermore, we want to know what genes are the most related to the albumine level.

  1. A first PLS model
    Consult the spls() help page, and fit an sparse-PLS regression model, first using \(r=10\) (10 latent variables) and keeping \(5\) original variables for each component (see the keepX parameter of the spls() function).
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic$ALB.g.dL.
p <- ncol(X)
p
## [1] 3116
n <- nrow(X)
n
## [1] 64
r <- 10
spls_albu <- spls(X, Y, ncomp = r, mode = "regression", keepX = rep(5, r))
  1. Number of latent variables
    How many latent variables do we have to keep ?
    [Use the “Mfold” cross-validation instead of the “loo” one if computational time is too high]
    Try to inscrease the number of variables per component and study the effect on the choice of the number of components.
    [take for exemple 20 and then 50 variables per component]
CV_albu <- perf(spls_albu, validation = "Mfold")
plot(x = 1:r, y = CV_albu$MSEP, col = "blue", lwd = 2, xlab = "Nb Comp", ylab = "10-fold CV error", 
    type = "b", main = "Evolution the cross-validation error against the number or component")

# On this example with 5 original variables per component, it seems that
# only one component is sufficient. However we keep 2 axes to be able to do
# the different graphs in the sequel.
spls_albu_20 <- spls(X, Y, ncomp = r, mode = "regression", keepX = rep(20, r))
CV_albu_20 <- perf(spls_albu_20, validation = "Mfold")
plot(x = 1:r, y = CV_albu_20$MSEP, col = "blue", lwd = 2, xlab = "Nb Comp", 
    ylab = "10-fold CV error", type = "b", main = "Evolution the cross-validation error with 20 var/comp")

# Again, only one component is sufficient here.
spls_albu_50 <- spls(X, Y, ncomp = r, mode = "regression", keepX = rep(50, r))
CV_albu_50 <- perf(spls_albu_50, validation = "Mfold")
plot(x = 1:r, y = CV_albu_50$MSEP, col = "blue", lwd = 2, xlab = "Nb Comp", 
    ylab = "10-fold CV error", type = "b", main = "Evolution the cross-validation error with 50 var/comp")

# This time the minimum is reached around 7 components, but the
# interpretation will be much difficult whereas the gain in terms of error
# is kind of small. We see that the number of variables per component has a
# strong effect on the behavior of the evolution of the cross-validation
# error against the number of component.

# The actual tuning of those parameters (number of components and numbers of
# variables used per component) is not trivial. One sound strategy is to
# first tune the number of variables for the first component, then the
# number of variables for the second, and so on. And stop when the error is
# not decreasing anymore.
  1. sPLS regression characteristics and norm of loadings vector
    Fit the sPLS regression model with the previously chosen parameters.
    Look at the explained variance proportions.
    Print the first coordinates of loadings vectors of \(X\).
    Compute \(L_2\) and \(L_1\) norms of the first loading vector of \(X\).
    Compare with the \(L_2\) and \(L_1\) norms of the first loading vector obtained when 50 variables are kept.
    Compute the covariance between \(Y\) and the first latent variable of \(X\).
    Compare with the covariance between \(Y\) and the first latent variable of \(X\) obtained when 50 variables are kept.
    Give the names of the genes which have been selected to build the first component.
spls_albu_opt <- spls(X, Y, ncomp = 2, mode = "regression", keepX = rep(5, 2))
spls_albu_opt$explained_variance
## $X
##     comp 1     comp 2 
## 0.08176828 0.04459205 
## 
## $Y
##    comp 1    comp 2 
## 1.0000000 0.4720824
head(spls_albu_opt$loadings$X)
##             comp 1 comp 2
## A_43_P14555      0      0
## A_43_P22290      0      0
## A_43_P20792      0      0
## A_43_P21286      0      0
## A_43_P12995      0      0
## A_43_P15834      0      0
normL2 <- sum(spls_albu_opt$loadings$X[, 1]^2)
normL2
## [1] 1
normL1 <- sum(abs(spls_albu_opt$loadings$X[, 1]))
normL1
## [1] 1.679634
normL2_50 <- sum(spls_albu_50$loadings$X[, 1]^2)
normL1_50 <- sum(abs(spls_albu_50$loadings$X[, 1]))
normL1_50
## [1] 5.13827
cov(Y, spls_albu_opt$variates$X[, 1])
## [1] 0.3062909
cov(Y, spls_albu_50$variates$X[, 1])
## [1] 0.8276925
genes1 <- which(spls_albu_opt$loadings$X[, 1] != 0)
genes1
##  A_43_P11285 A_42_P678904 A_42_P505480  A_43_P10088 A_42_P591665 
##         2864         2865         2866         2867         2923
  1. Individuals representation
    Plot the projection of indivuals on the first two latent variables of \(X\), and add the measurement time and the paracetamol dose. Comment.
# Dose
plotIndiv(spls_albu_opt, comp = 1:2, ind.names = FALSE, rep.space = "X-variate", 
    group = liver.toxicity$treatment$Dose.Group, legend = TRUE, legend.title = "Dose")

# Time of measurement:
plotIndiv(spls_albu_opt, comp = 1:2, ind.names = FALSE, rep.space = "X-variate", 
    group = liver.toxicity$treatment$Time.Group, legend = TRUE, legend.title = "Time")