Why does mutate() by group take forever?

Question

I have a 2.5M x 13 matrix which I try to aggregate by an ID variable. At first I tried with using ddply, but my memory exploded. Afterwards I tried using data.table, which works a lot faster:

data <- as.data.table(data)
key(data) <- "ID"

agg<-mydata[,mutate(.SD,
           start = min(Date))
           , by = ID]

Now there's no memory problem, though, so far it takes more than ~4 hours to run this on an Intel i5 2.50GHz with 4.0GB of RAM. Operating system is windows 7, so no parallel computing.

What am I doing wrong?

konvas · Accepted Answer

You don't need mutate, just use start := min(Date). I believe that should speed it up a lot.

agg <- mydata[, start := min(Date), by = ID]

nrussell · Answer

@konvas beat me to it, but you should be able to validate that := is faster:

##
library(data.table)
library(plyr)
library(microbenchmark)
##
t0 <- as.Date("2013-01-01")
Df <- data.frame(
  ID=sample(LETTERS,500000,replace=TRUE),
  Date=t0+sample((-100):100,500000,replace=TRUE),
  stringsAsFactors=FALSE)
Dt1 <- data.table(Df)
setkeyv(Dt1,cols="ID")
Dt2 <- copy(Dt1)
##
f1 <- function(){
  Agg <- Dt1[
    ,
    mutate(.SD,start = min(Date)),
    by = list(ID)]
}
f2 <- function(){
  Agg <- Dt2[
    ,
    "Start":=min(Date),
    by=list(ID)]
}
##
Res <- microbenchmark(
  f1(),f2()
)
##
Unit: milliseconds
expr      min       lq   median       uq      max neval
f1() 25.08676 27.30188 28.22867 31.60754 63.97749   100
f2() 10.48293 11.39930 13.25193 14.26284 47.80564   100

Why does mutate() by group take forever?

Tags:

r

aggregate

data.table

Guest3290

2 Answers

konvas

nrussell

Recent Activity

Donate For Us

Why does mutate() by group take forever?

Tags:

r

aggregate

data.table

Guest3290

2 Answers

konvas

nrussell

Related questions

Recent Activity

Donate For Us