How to custom a model in CARET to perform PLS-[Classifer] two-step classificaton model?

Tags:

This question is a continuation of the same thread here. Below is a minimal working example taken from this book:

Wehrens R. Chemometrics with R multivariate data analysis in the natural sciences and life sciences. 1st edition. Heidelberg; New York: Springer. 2011. (page 250).

The example was taken from this book and its package ChemometricsWithR. It highlighted some pitfalls when modeling using cross-validation techniques.

The Aim:
A cross-validated methodology using the same set of repeated CV to perform a known strategy of PLS followed typically by LDA or cousins like logistic regression, SVM, C5.0, CART, with the spirit of caret package. So PLS would be needed every time before calling the waiting classifier in order to classify PLS score space instead of the observations themselves. The nearest approach in the caret package is doing PCA as a pre-processing step before modeling with any classifier. Below is a PLS-LDA procedure with only one cross-validation to test performance of the classifier, there was no 10-fold CV or any repetition. The code below was taken from the mentioned book but with some corrections otherwise throws error:

library(ChemometricsWithR)
data(prostate)
prostate.clmat <- classvec2classmat(prostate.type) # convert Y to a dummy var

odd <- seq(1, length(prostate.type), by = 2) # training
even <- seq(2, length(prostate.type), by = 2) # holdout test

prostate.pls <- plsr(prostate.clmat ~ prostate, ncomp = 16, validation = "CV", subset=odd)

Xtst <- scale(prostate[even,], center = colMeans(prostate[odd,]), scale = apply(prostate[odd,],2,sd))

tst.scores <- Xtst %*% prostate.pls$projection # scores for the waiting trained LDA to test

prostate.ldapls <- lda(scores(prostate.pls)[,1:16],prostate.type[odd]) # LDA for scores
table(predict(prostate.ldapls, new = tst.scores[,1:16])$class, prostate.type[even])

predictionTest <- predict(prostate.ldapls, new = tst.scores[,1:16])$class)

library(caret)    
confusionMatrix(data = predictionTest, reference= prostate.type[even]) # from caret

Output:

Confusion Matrix and Statistics

          Reference
Prediction bph control pca
   bph       4       1   9
   control   1      35   7
   pca      34       4  68

Overall Statistics

               Accuracy : 0.6564          
                 95% CI : (0.5781, 0.7289)
    No Information Rate : 0.5153          
    P-Value [Acc > NIR] : 0.0001874       

                  Kappa : 0.4072          
 Mcnemar's Test P-Value : 0.0015385       

Statistics by Class:

                     Class: bph Class: control Class: pca
Sensitivity             0.10256         0.8750     0.8095
Specificity             0.91935         0.9350     0.5190
Pos Pred Value          0.28571         0.8140     0.6415
Neg Pred Value          0.76510         0.9583     0.7193
Prevalence              0.23926         0.2454     0.5153
Detection Rate          0.02454         0.2147     0.4172
Detection Prevalence    0.08589         0.2638     0.6503
Balanced Accuracy       0.51096         0.9050     0.6643

However, the confusion matrix didn't match that in the book, anyway the code in the book did break, but this one here worked with me!

Notes:
Although this was only one CV, but the intention is to agree on this methodology first, sd and mean of the train set were applied on the test set, PLUS transformed into PLS scores based a specific number of PC ncomp. I want this to occur every round of the CV in the caret. If the methodology as code is correct here, then it can serve, may be, as a good start for a minimal work example while modifying the code of the caret package.

Side Notes:
It can be very messy with scaling and centering, I think some of the PLS functions in R do scaling internally, with or without centering, I am not sure, so building a custom model in caret should be handled with care to avoid both lack or multiple scalings or centerings (I am on my guards with these things).

Perils of multiple centering/scaling
The code below is just to show how multliple centering/scaling can change the data, only centering is shown here but the same problem with scaling applies too.

set.seed(1)
x <- rnorm(200, 2, 1)
xCentered1 <- scale(x, center=TRUE, scale=FALSE)
xCentered2 <- scale(xCentered1, center=TRUE, scale=FALSE)
xCentered3 <- scale(xCentered2, center=TRUE, scale=FALSE)
sapply (list(xNotCentered= x, xCentered1 = xCentered1, xCentered2 = xCentered2, xCentered3 = xCentered3), mean)

Output:

xNotCentered    xCentered1    xCentered2    xCentered3 
 2.035540e+00  1.897798e-16 -5.603699e-18 -5.332377e-18

Please drop a comment if I am missing something somewhere in this course. Thanks.

888

asked Jan 11 '14 22:01

doctorate

1 Answers

If you want to fit these types of models with caret, you would need to use the latest version on CRAN. The last update was created so that people can use non-standard models as they see fit.

My approach below is to jointly fit the PLS and other model (I used random forest in the example below) and tune them at the same time. So for each fold, a 2D grid of ncomp and mtry is used.

The "trick" is to attached the PLS loadings to the random forest object so that they can be used during prediction time. Here is the code that defines the model (classification only):

 modelInfo <- list(label = "PLS-RF",
              library = c("pls", "randomForest"),
              type = "Classification",
              parameters = data.frame(parameter = c('ncomp', 'mtry'),
                                      class = c("numeric", 'numeric'),
                                      label = c('#Components', 
                                                '#Randomly Selected Predictors')),
              grid = function(x, y, len = NULL) {
                grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
                            mtry = 1:len)
                grid <- subset(grid, mtry <= ncomp)
                },
              loop = NULL,
              fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                     ## First fit the pls model, generate the training set scores,
                     ## then attach what is needed to the random forest object to 
                     ## be used later
                     pre <- plsda(x, y, ncomp = param$ncomp)
                     scores <- pls:::predict.mvr(pre, x, type = "scores")
                     mod <- randomForest(scores, y, mtry = param$mtry, ...)
                     mod$projection <- pre$projection
                     mod
                   },
                   predict = function(modelFit, newdata, submodels = NULL) {       
                     scores <- as.matrix(newdata)  %*% modelFit$projection
                     predict(modelFit, scores)
                   },
                   prob = NULL,
                   varImp = NULL,
                   predictors = function(x, ...) rownames(x$projection),
                   levels = function(x) x$obsLevels,
                   sort = function(x) x[order(x[,1]),])

and here is the call to train:

 library(ChemometricsWithR)
 data(prostate)

 set.seed(1)
 inTrain <- createDataPartition(prostate.type, p = .90)
 trainX <-prostate[inTrain[[1]], ]
 trainY <- prostate.type[inTrain[[1]]]
 testX <-prostate[-inTrain[[1]], ]
 testY <- prostate.type[-inTrain[[1]]]

 ## These will take a while for these data
 set.seed(2)
 plsrf <- train(trainX, trainY, method = modelInfo,
                preProc = c("center", "scale"),
                tuneLength = 10,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5))

 ## How does random forest do on its own?
 set.seed(2)
 rfOnly <- train(trainX, trainY, method = "rf",
                tuneLength = 10,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5))

Just for kicks, I got:

 > getTrainPerf(plsrf)
   TrainAccuracy TrainKappa method
 1     0.7940423    0.65879 custom
 > getTrainPerf(rfOnly)
   TrainAccuracy TrainKappa method
 1     0.7794082  0.6205322     rf

and

 > postResample(predict(plsrf, testX), testY)
  Accuracy     Kappa 
 0.7741935 0.6226087 
 > postResample(predict(rfOnly, testX), testY)
  Accuracy     Kappa 
 0.9032258 0.8353982

Max

138

answered Sep 30 '22 06:09

topepo

Related questions
                            
                                R regex / gsub : extract part of pattern
                            
                                Extracting p-value from lapply list of glm fits
                            
                                cbind items from multiple lists recursively
                            
                                aggregation with data.table in R
                            
                                how to get a list of pvalues?
                            
                                R Shiny: not sure why ggplot is failing
                            
                                Problems formatting date into format "%Y-%m" [duplicate]
                            
                                R type provider and Ggplot2
                            
                                Pass string variable in R script to use it in SQL statement
                            
                                R neuralNet: "non-conformable arguments"
                            
                                Aggregate() with more functions -
                            
                                ggplot2 geom_line() to skip NA values
                            
                                R get objects' names from the list of objects
                            
                                R: Pearson correlation rcorr(x,y) [x=matrix, y=vector] ignores y
                            
                                Partial vector addition in R
                            
                                Applying a function row-wise for a dataset
                            
                                Count Pattern Matching in R
                            
                                R package Kohonen - how to plot hexagons instead of circles as in Matlab SOM toolbox?
                            
                                cforest prints empty tree
                            
                                Nested lapply() in a list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to custom a model in CARET to perform PLS-[Classifer] two-step classificaton model?

Tags:

r

classification

r-caret

cross-validation

doctorate

People also ask

1 Answers

topepo

Recent Activity

Donate For Us