I am struggling for several days to perform a classification tree using the caret package. The problem are my factor variables. I generate the tree, but when I try to use the best model to make predictions on the test sample, it fails, because the train function creates dummies for my factor variables and then the predict function cannot find these newly created dummies in the test set. How should I deal with this problem? My code is as follows: <pre class="prettyprint"><code>install.packages("caret", dependencies = c("Depends", "Suggests")) library(caret) db=data.frame(read.csv ("db.csv", head=TRUE, sep=";", na.strings ="?")) fix(db) db$defaillance=factor(db$defaillance) db$def=ifelse(db$defaillance==0,"No","Yes") db$def=factor(db$def) db$defaillance=NULL db$canal=factor(db$canal) db$sect_isodev=factor(db$sect_isodev) db$sect_risq=factor(db$sect_risq) #delete zero variance predictors nzv <- nearZeroVar(db[,-78]) db_new <- db[,-nzv] inTrain <- createDataPartition(y = db_new$def, p = .75, list = FALSE) training <- db_new[inTrain,] testing <- db_new[-inTrain,] str(training) str(testing) dim(training) dim(testing) </code></pre> A sample o the str() function for training/testing is found below: <pre class="prettyprint"><code> $ FDR : num 1305 211 162 131 143 ... $ FCYC : num 0.269 0.18 0.154 0.119 0.139 ... $ BFDR : num 803 164 108 72 76 63 100 152 188 80 ... $ TRES : num 502 47 54 59 67 49 53 -7 -103 -109 ... $ sect_isodev: Factor w/ 9 levels "1","2","3","4",..: 4 3 3 3 3 3 3 3 3 3 ... $ sect_risq : Factor w/ 6 levels "0","1","2","3",..: 6 6 6 6 6 6 6 6 6 6 ... $ def : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ... > dim(training) [1] 14553 42 > dim(testing) [1] 4850 42 </code></pre> Then my code goes like this: <pre class="prettyprint"><code>fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, classProbs = TRUE, summaryFunction = twoClassSummary) #CART1 set.seed(1234) tree1 = train (def~., training, method = "rpart", tuneLength=20, metric="ROC", trControl = fitControl) </code></pre> A sample of <pre class="prettyprint"><code>summary(tree1$finalModel) </code></pre> is here <pre class="prettyprint"><code>RNTB 38.397731 sect_isodev1 6.742289 sect_isodev3 4.005016 sect_isodev8 2.520850 sect_risq3 9.909127 sect_risq4 6.737908 sect_risq5 3.085714 SOLV 73.067539 TRES 47.906884 sect_isodev2 0.000000 sect_isodev4 0.000000 sect_isodev5 0.000000 sect_isodev6 0.000000 sect_isodev7 0.000000 sect_isodev9 0.000000 sect_risq0 0.000000 sect_risq1 0.000000 sect_risq2 0.000000 </code></pre> And here is the error: <blockquote> model.tree1 <- predict(tree1$finalModel,testing) Error in eval(expr, envir, enclos) : object 'sect_isodev1' not found </blockquote> I am curious yet about another thing. I have found in Max Kuhn's "Predictive Modelling with R" the following syntax: <pre class="prettyprint"><code>predict(rpartTune$finalModel, newdata, type = "class") </code></pre> where <code>rpartTune$finalModel</code> is a classification tree identical to mine (or mine identical to his). Now, R doesn't accept type="class". Only type="prob". I am troubled because of that. Thank you in advance for your responses

Don't use <code>predict.rpart</code> with the <code>train$finalModel</code> unless you have a really good reason. The <code>rpart</code> object does;t know about anything that <code>train</code> did, including pre-process. It may not give you the correct answer. After all, you might be using <code>train</code> in order to avoid the minutia so let <code>predict.train</code> do the work. Max EDIT - About the <code>type = "class"</code> and <code>type = "prob"</code> bit.. <code>predict.rpart</code> defaults to producing class probabilities. Although <code>rpart</code> is one of the earliest packages, that is atypical as most produce classes by default. <code>predict.train</code> produces the classes by default and you have to use <code>type = "prob"</code> to get probabilities.

R caret package (rpart): constructing a classification tree

Q: What is the caret package in R?

Caret is a one-stop solution for machine learning in R. The R package caret has a powerful train function that allows you to fit over 230 different models using one syntax. There are over 230 models included in the package including various tree-based models, neural nets, deep learning and much more.

Q: What is Rpart in decision tree?

rpart: Recursive Partitioning and Regression Trees.

Q: How many caret models are there?

In total, there are 233 different models available in caret . This blog post will focus on regression-type models (those with a continuous outcome), but classification models are also easily applied in caret using the same basic syntax.

Tags:

r

r-caret

rpart

I am struggling for several days to perform a classification tree using the caret package. The problem are my factor variables. I generate the tree, but when I try to use the best model to make predictions on the test sample, it fails, because the train function creates dummies for my factor variables and then the predict function cannot find these newly created dummies in the test set. How should I deal with this problem?

My code is as follows:

install.packages("caret", dependencies = c("Depends", "Suggests"))      
library(caret)                                      
db=data.frame(read.csv ("db.csv", head=TRUE, sep=";", na.strings ="?"))     
fix(db)
db$defaillance=factor(db$defaillance)
db$def=ifelse(db$defaillance==0,"No","Yes") 
db$def=factor(db$def)
db$defaillance=NULL
db$canal=factor(db$canal)
db$sect_isodev=factor(db$sect_isodev)
db$sect_risq=factor(db$sect_risq)       

#delete zero variance predictors                                
nzv <- nearZeroVar(db[,-78])
db_new <- db[,-nzv]

inTrain <- createDataPartition(y = db_new$def, p = .75, list = FALSE)                               
training <- db_new[inTrain,]
testing <- db_new[-inTrain,]
str(training)
str(testing)
dim(training)
dim(testing)

A sample o the str() function for training/testing is found below:

 $ FDR        : num  1305 211 162 131 143 ...
 $ FCYC       : num  0.269 0.18 0.154 0.119 0.139 ...
 $ BFDR       : num  803 164 108 72 76 63 100 152 188 80 ...
 $ TRES       : num  502 47 54 59 67 49 53 -7 -103 -109 ...
 $ sect_isodev: Factor w/ 9 levels "1","2","3","4",..: 4 3 3 3 3 3 3 3 3 3 ...
 $ sect_risq  : Factor w/ 6 levels "0","1","2","3",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ def        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
> dim(training)
[1] 14553    42
> dim(testing)
[1] 4850   42

Then my code goes like this:

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10,
                   classProbs = TRUE,
                   summaryFunction = twoClassSummary)

#CART1
set.seed(1234)
tree1 = train (def~.,
           training,
           method = "rpart",
           tuneLength=20,
           metric="ROC",
           trControl = fitControl)

A sample of

summary(tree1$finalModel)

is here

RNTB          38.397731
sect_isodev1   6.742289
sect_isodev3   4.005016
sect_isodev8   2.520850
sect_risq3     9.909127
sect_risq4     6.737908
sect_risq5     3.085714
SOLV          73.067539
TRES          47.906884
sect_isodev2   0.000000
sect_isodev4   0.000000
sect_isodev5   0.000000
sect_isodev6   0.000000
sect_isodev7   0.000000
sect_isodev9   0.000000
sect_risq0     0.000000
sect_risq1     0.000000
sect_risq2     0.000000

And here is the error:

model.tree1 <- predict(tree1$finalModel,testing) Error in eval(expr, envir, enclos) : object 'sect_isodev1' not found

I am curious yet about another thing. I have found in Max Kuhn's "Predictive Modelling with R" the following syntax:

predict(rpartTune$finalModel, newdata, type = "class")

where rpartTune$finalModel is a classification tree identical to mine (or mine identical to his). Now, R doesn't accept type="class". Only type="prob". I am troubled because of that.

Thank you in advance for your responses

809

asked Dec 18 '14 17:12

lorelai

1 Answers

Don't use predict.rpart with the train$finalModel unless you have a really good reason. The rpart object does;t know about anything that train did, including pre-process. It may not give you the correct answer. After all, you might be using train in order to avoid the minutia so let predict.train do the work.

Max

EDIT -

About the type = "class" and type = "prob" bit..

predict.rpart defaults to producing class probabilities. Although rpart is one of the earliest packages, that is atypical as most produce classes by default.

predict.train produces the classes by default and you have to use type = "prob" to get probabilities.

answered Nov 11 '22 01:11

topepo

Related questions
                            
                                R How to Get the Average of One Variable based on Ranges of Another Variable?
                            
                                scatter.smooth R function - color
                            
                                Mend reshape-based habits with plyr: melt/cast vs. ddply
                            
                                faster way to create variable that aggregates a column by id [duplicate]
                            
                                Replacing numbers within a range with a factor [duplicate]
                            
                                inserting custom text to ggplot2
                            
                                Getting names from ... (dots)
                            
                                LaTeX and R bundle?
                            
                                R equivalent of .first or .last sas operator
                            
                                Aggregate multiple variables with different functions [duplicate]
                            
                                R programming i need help finding sum of a list with 2 columns
                            
                                R remove stopwords from a character vector using %in%
                            
                                Convert numeric vector to binary (0/1) based on limit
                            
                                Reclassify select columns in Data Table
                            
                                ggplot2 positive and negative values different color gradient
                            
                                How to use italics for facet labels in ggplot2?
                            
                                Replace empty values with value from other column in a dataframe
                            
                                R Extract rows where column greater than 40 [duplicate]
                            
                                Converting nested list (unequal length) to data frame [duplicate]
                            
                                Split string into substrings of given length with remainder

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With