I have a data.table in R which I want to use with caret package
set.seed(42)
trainingRows<-createDataPartition(DT$variable, p=0.75, list=FALSE)
head(trainingRows) # view the samples of row numbers
However, I am not able to select the rows with data.table. Instead I had to convert to a data.frame
DT_df <-as.data.frame(DT)
DT_train<-DT_df[trainingRows,]
dim(DT_train)
the data.table alternative
DT_train <- DT[.(trainingRows),] requires the keys to be set.
Any better option other than converting to data.frame?
Roll you own
inTrain <- sample(MyDT[, .I], floor(MyDT[, .N] * .75))
Train <- MyDT[inTrain]
Test <- MyDT[-inTrain]
Or with Caret function you can just wrap trainingRows
with a c().
trainingRows<-createDataPartition(DT$variable, p=0.75, list=FALSE)
Train <- DT[c(trainingRows)]
Test <- DT[c(-trainingRows)]
===
Edit by Matt
Was going to add a comment, but too long.
I've seen sample(.I,...)
being used quite a bit recently. This is inefficient because it has to create the (potentially very long) .I
vector which is just 1:nrow(DT)
. This is such a common case that R's sample()
doesn't need you to pass that vector. Just pass the length. sample(nrow(DT))
already returns exactly the same result without having to create .I
. See ?sample
.
Also, it's better to avoid variable name repetition wherever possible. More background here.
So instead of :
inTrain <- sample(MyDT[, .I], floor(MyDT[, .N] * .75))
I'd do :
inTrain <- MyDT[,sample(.N, floor(.N*.75))]
The reason is that createDataPartition
produces integer vector with two dimensions where the second could be losslessly dropped.
You can simply reduce dimension of trainingRows
using below:
DT[trainingRows[,1]]
The c()
function from Bruce Pucci's answer will reduce dimension too.
This minor difference vs. data.frame was spotted long time ago and recently I've made PR #1275 to fill that gap.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With