Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

r caret predict returns fewer output than input

Tags:

r

r-caret

rpart

I used caret to train an rpart model below.

trainIndex <- createDataPartition(d$Happiness, p=.8, list=FALSE)
dtrain <- d[trainIndex, ]
dtest <- d[-trainIndex, ]
fitControl <- trainControl(## 10-fold CV
  method = "repeatedcv", number=10, repeats=10)
fitRpart <- train(Happiness ~ ., data=dtrain, method="rpart",
                trControl = fitControl)
testRpart <- predict(fitRpart, newdata=dtest)

dtest contains 1296 observations, so I expected testRpart to produce a vector of length 1296. Instead it's 1077 long, i.e. 219 short.

When I ran the prediction on the first 220 rows of dtest, I got a predicted result of 1, so it's consistently 219 short.

Any explanation on why this is so, and what I can do to get a consistent output to the input?

Edit: d can be loaded from here to reproduce the above.

like image 585
Ricky Avatar asked Jun 07 '15 03:06

Ricky


Video Answer


2 Answers

I downloaded your data and found what explains the discrepancy.

If you simply remove the missing values from your dataset, the length of the outputs match:

testRpart <- predict(fitRpart, newdata = na.omit(dtest))

Note nrow(na.omit(dtest)) is 1103, and length(testRpart) is 1103. So you need a strategy to address missing values. See ?predict.rpart and the options for the na.action parameter to choose what you want.

like image 119
Josh W. Avatar answered Oct 20 '22 01:10

Josh W.


Similar to what Josh mentioned, if you need to generate predictions using predict.train from caret, simply pass the na.action of na.pass:

testRpart <- predict(fitRpart, newdata = dtest, na.action = na.pass)

Note: moving this to a separate answer based on Ricky's comment on Josh's answer above for visibility.

like image 33
davedgd Avatar answered Oct 20 '22 01:10

davedgd