I have a 2.5M x 13 matrix which I try to aggregate by an ID variable. At first I tried with using ddply, but my memory exploded. Afterwards I tried using data.table, which works a lot faster:
data <- as.data.table(data)
key(data) <- "ID"
agg<-mydata[,mutate(.SD,
start = min(Date))
, by = ID]
Now there's no memory problem, though, so far it takes more than ~4 hours to run this on an Intel i5 2.50GHz with 4.0GB of RAM. Operating system is windows 7, so no parallel computing.
What am I doing wrong?
You don't need mutate
, just use start := min(Date)
. I believe that should speed it up a lot.
agg <- mydata[, start := min(Date), by = ID]
@konvas beat me to it, but you should be able to validate that :=
is faster:
##
library(data.table)
library(plyr)
library(microbenchmark)
##
t0 <- as.Date("2013-01-01")
Df <- data.frame(
ID=sample(LETTERS,500000,replace=TRUE),
Date=t0+sample((-100):100,500000,replace=TRUE),
stringsAsFactors=FALSE)
Dt1 <- data.table(Df)
setkeyv(Dt1,cols="ID")
Dt2 <- copy(Dt1)
##
f1 <- function(){
Agg <- Dt1[
,
mutate(.SD,start = min(Date)),
by = list(ID)]
}
f2 <- function(){
Agg <- Dt2[
,
"Start":=min(Date),
by=list(ID)]
}
##
Res <- microbenchmark(
f1(),f2()
)
##
Unit: milliseconds
expr min lq median uq max neval
f1() 25.08676 27.30188 28.22867 31.60754 63.97749 100
f2() 10.48293 11.39930 13.25193 14.26284 47.80564 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With