Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does mutate() by group take forever?

I have a 2.5M x 13 matrix which I try to aggregate by an ID variable. At first I tried with using ddply, but my memory exploded. Afterwards I tried using data.table, which works a lot faster:

data <- as.data.table(data)
key(data) <- "ID"

agg<-mydata[,mutate(.SD,
           start = min(Date))
           , by = ID]

Now there's no memory problem, though, so far it takes more than ~4 hours to run this on an Intel i5 2.50GHz with 4.0GB of RAM. Operating system is windows 7, so no parallel computing.

What am I doing wrong?

like image 327
Guest3290 Avatar asked Dec 25 '22 06:12

Guest3290


2 Answers

You don't need mutate, just use start := min(Date). I believe that should speed it up a lot.

agg <- mydata[, start := min(Date), by = ID]
like image 186
konvas Avatar answered Jan 10 '23 08:01

konvas


@konvas beat me to it, but you should be able to validate that := is faster:

##
library(data.table)
library(plyr)
library(microbenchmark)
##
t0 <- as.Date("2013-01-01")
Df <- data.frame(
  ID=sample(LETTERS,500000,replace=TRUE),
  Date=t0+sample((-100):100,500000,replace=TRUE),
  stringsAsFactors=FALSE)
Dt1 <- data.table(Df)
setkeyv(Dt1,cols="ID")
Dt2 <- copy(Dt1)
##
f1 <- function(){
  Agg <- Dt1[
    ,
    mutate(.SD,start = min(Date)),
    by = list(ID)]
}
f2 <- function(){
  Agg <- Dt2[
    ,
    "Start":=min(Date),
    by=list(ID)]
}
##
Res <- microbenchmark(
  f1(),f2()
)
##
Unit: milliseconds
expr      min       lq   median       uq      max neval
f1() 25.08676 27.30188 28.22867 31.60754 63.97749   100
f2() 10.48293 11.39930 13.25193 14.26284 47.80564   100
like image 31
nrussell Avatar answered Jan 10 '23 07:01

nrussell