Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merging endpoints of a range with a sequence

Tags:

r

data.table

plyr

In one of my application there is a piece of code that retrieve information from a data.table object depending on values in another.

# say this table contains customers details
dt <- data.table(id=LETTERS[1:4],
                 start=seq(as.Date("2010-01-01"), as.Date("2010-04-01"), "month"),
                 end=seq(as.Date("2010-01-01"), as.Date("2010-04-01"), "month") + c(6,8,10,5),
                 key="id")

# this one has some historical details
dt1 <- data.table(id=rep(LETTERS[1:4], each=120),
                  date=seq(as.Date("2010-01-01"), as.Date("2010-04-30"), "day"),
                  var=rnorm(120),
                  key="id,date")

# and here I finally retrieve my historical information based one customer detail
#
library(data.table)

myfunc <- function(x) {
  # some code
  period <- seq(x$start, x$end, "day")
  dt1[.(x$id, period)][, mean(var)]
  # some code
}

to get the result for all I use adply

library(plyr)
library(microbenchmark)
> adply(dt, 1, myfunc)
   id      start        end         V1
1:  A 2010-01-01 2010-01-07  0.3143536
2:  B 2010-02-01 2010-02-09 -0.5796084
3:  C 2010-03-01 2010-03-11  0.1171404
4:  D 2010-04-01 2010-04-06  0.2384237

> microbenchmark(adply(dt, 1, myfunc))
Unit: milliseconds
                 expr      min       lq   median       uq      max neval
 adply(dt, 1, myfunc) 8.812486 8.998338 9.105776 9.223637 88.14057   100

Do you know a way to avoid the adply call and do the above in one data.table statement? Or anyway a faster method? (title edit suggestion more than welcome, I couldn't think a better one, thanks)

like image 608
Michele Avatar asked Jul 11 '13 15:07

Michele


2 Answers

This is a great spot to use the roll argument of data.table:

setkey(dt1, id, date)
setkey(dt, id, start)

dt[dt1, roll = TRUE][end >= start,
   list(start = start[1], end = end[1], result = mean(var)), by = id]

# benchmark
microbenchmark(OP    = adply(dt, 1, myfunc),
               Frank = dt[dt1[as.list(dt[,seq.Date(start,end,"day"),by="id"])][,mean(var),by=id]],
               eddi  = dt[dt1, roll = TRUE][end >= start,list(start = start[1], end = end[1], result = mean(var)), by = id])
#Unit: milliseconds
#  expr       min        lq    median        uq       max neval
#    OP 24.436126 29.184786 30.853094 32.493521 50.898664   100
# Frank  9.115676 11.303691 12.081000 13.122753 28.370415   100
#  eddi  5.336315  6.323643  6.771898  7.497285  9.531376   100

The time difference will become much more dramatic as the size of the datasets grows.

like image 117
eddi Avatar answered Nov 15 '22 03:11

eddi


I can give you a bunch of nested [.data.table calls:

set.seed(1)
require(data.table)
# generate dt, dt1 as above
dt[
    dt1[
        as.list(dt[,seq.Date(start,end,"day"),by="id"])
    ][,mean(var),by=id]
]

#    id      start        end          V1
# 1:  A 2010-01-01 2010-01-07  0.04475859
# 2:  B 2010-02-01 2010-02-09 -0.01681972
# 3:  C 2010-03-01 2010-03-11  0.39791318
# 4:  D 2010-04-01 2010-04-06  0.77854732

I'm using as.list to unset the key. I wonder if there's a better way than this...

require(microbenchmark)
require(plyr)
microbenchmark(
    adply=adply(dt, 1, myfunc),
    dtdtdt= dt[dt1[as.list(dt[,seq.Date(start,end,"day"),by="id"])][,mean(var),by=id]]
)

# Unit: milliseconds
#    expr       min        lq    median        uq       max neval
#   adply 12.987334 13.247374 13.477386 14.371258 18.362505   100
#  dtdtdt  4.854708  4.944596  4.993678  5.233507  7.082461   100

EDIT: (eddi) Alternatives to the above that would require one less merge (as discussed in comments) are:

setkey(dt, NULL)

dt1[dt[, list(seq.Date(start,end,"day"), end), by=id]][,
    list(start = date[1], end = end[1], result = mean(var)), by = id]
# or
dt1[dt[, seq.Date(start,end,"day"), by=id]][,
    list(start = date[1], end = date[.N], result = mean(var)), by = id]
like image 24
Frank Avatar answered Nov 15 '22 02:11

Frank