Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do results using caret::train(..., method = "rpart") differ from rpart::rpart(...)?

Tags:

r

r-caret

rpart

I'm taking part in the Coursera Practical Machine Learning course, and the coursework requires building predictive models using this dataset. After splitting the data into training and testing datasets, based on the outcome of interest (herewith labelled y, but is in fact the classe variable in the dataset):

inTrain <- createDataPartition(y = data$y, p = 0.75, list = F) 
training <- data[inTrain, ] 
testing <- data[-inTrain, ] 

I have tried 2 different methods:

modFit <- caret::train(y ~ ., method = "rpart", data = training)
pred <- predict(modFit, newdata = testing)
confusionMatrix(pred, testing$y)

vs.

modFit <- rpart::rpart(y ~ ., data = training)
pred <- predict(modFit, newdata = testing, type = "class")
confusionMatrix(pred, testing$y)

I would assume they would give identical or very similar results, as the initial method loads the 'rpart' package (suggesting to me it uses this package for the method). However, the timings (caret much slower) & results are very different:

Method 1 (caret):

Confusion Matrix and Statistics

Reference
Prediction    A    B    C    D    E
         A 1264  374  403  357  118
         B   25  324   28  146  124
         C  105  251  424  301  241
         D    0    0    0    0    0
         E    1    0    0    0  418

Method 2 (rpart):

Confusion Matrix and Statistics

Reference 
Prediction    A    B    C    D    E
         A 1288  176   14   79   25
         B   36  569   79   32   68
         C   31   88  690  121  113
         D   14   66   52  523   44
         E   26   50   20   49  651

As you can see, the second approach is a better classifier - the first method is very poor for classes D & E.

I realise this may not be the most appropriate place to ask this question, but I would really appreciate a greater understanding of this and related issues. caret seems like a great package to unify the methods and call syntax, but I'm now hesitant to use it.

like image 778
Jonny Avatar asked Mar 20 '15 13:03

Jonny


1 Answers

caret actually does quite a bit more under the hood. In particular, it uses cross-validation to optimize the model hyperparameters. In your case, it tries three values of cp (type modFit and you'll see accuracy results for each value), whereas rpart just uses 0.01 unless you tell it otherwise (see ?rpart.control). The cross-validation will also take longer, especially since caret uses bootstrapping by default.

In order to get similar results, you need to disable cross-validation and specify cp:

modFit <- caret::train(y ~ ., method = "rpart", data = training,
                       trControl=trainControl(method="none"),
                       tuneGrid=data.frame(cp=0.01))

In addition, you should use the same random seed for both models.

That said, the extra functionality that caret provides is a Good Thing, and you should probably just go with caret. If you want to learn more, it's well-documented, and the author has a stellar book, Applied Predictive Modeling.

like image 69
Peyton Avatar answered Sep 28 '22 02:09

Peyton