Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge a data.table with itself after a reference lookup

Tags:

r

data.table

If I have the data.tables DT and neighbors:

set.seed(1)
library(data.table)
DT <- data.table(idx=rep(1:10, each=5), x=rnorm(50), y=letters[1:5], ok=rbinom(50, 1, 0.90))
n <- data.table(y=letters[1:5], y1=letters[c(2:5,1)])

n is a lookup table. Whenever ok == 0, I want to look up the corresponding y1 in n and use that value for x and the given idx. By way of example, row 4 of DT:

> DT
   idx          x y ok
1:   1 -0.6264538 a  1
2:   1  0.1836433 b  1
3:   1 -0.8356286 c  1
4:   1  1.5952808 d  0
5:   1  0.3295078 e  1
6:   2 -0.8204684 a  1

The y1 from n for d is e:

> n[y == 'd']
   y y1
1: d  e

and idx for row 4 is 1. So I would use:

> DT[idx == 1 & y == 'e', x]
[1] 0.3295078

I want my output to be a data.table just like DT[ok == 0] with all the x values replaced by their appropriate n['y1'] x value:

> output
   idx          x y ok
1:   1  0.3295078 d  0
2:   2 -0.3053884 d  0
3:   3  0.3898432 a  0
4:   5  0.7821363 a  0
5:   7  1.3586800 e  0
6:   8  0.7631757 d  0

I can think of a few ways of doing this with base R or with plyr... and maybe its late on Friday... but whatever the sequences of merges that this would require in data.table is beyond me!

like image 644
Justin Avatar asked Sep 14 '12 23:09

Justin


Video Answer


2 Answers

Great question. Using the functions in the other answers and wrapping Blue's answer into a function blue, how about the following. The benchmarks include the time to setkey in all cases.

red = function() {
    ans = DT[ok==0]
      # Faster than setkey(DT,ok)[J(0)] if the vector scan is just once
      # If lots of lookups to "ok" need to be done, then setkey may be worth it
      # If DT[,ok:=as.integer(ok)] can be done first, then ok==0L slightly faster

    # After extracting ans in the original order of DT, we can now set the key :
    setkey(DT,idx,y)
    setkey(n,y)

    # Now working with the reduced ans ...

    ans[,y1:=n[y,y1,mult="first"]]
    # Add a new column y1 by reference containing the lookup in n
    # mult="first" because we know n's key is unique, for speed (to save looking
    # for groups of matches in n). Future version of data.table won't need this.
    # Also, mult="first" has the advantage of dropping group columns (so we don't
    # need [[2L]]). mult="first"|"last" turns off by-without-by of mult="all".

    ans[,x:=DT[ans[,list(idx,y1)],x,mult="first"]]
    # Changes the contents of ans$x by reference. The ans[,list(idx,y1)] part is
    # how to pick the columns of ans to join to DT's key when they are not the key
    # columns of ans and not the first 1:n columns of ans. There is no need to key
    # ans, especially since that would change ans's order and not strictly answer
    # the question. If idx and y1 were columns 1 and 2 of (unkeyed) ans then we
    # wouldn't need that part, just
    #    ans[,x:=DT[ans,x,mult="first"]]
    # would do (relying on DT having 2 columns in its key). That has the advantage
    # of not copying the idx and y1 columns into a new data.table to pass as the i
    # DT. To save that copy y1 could be moved to column 2 using setcolorder first.

    redans <<- ans
    }


crdt(1e5)
origDT = copy(DT)
benchmark(blue={DT=copy(origDT); system.time(blue())},
          red={DT=copy(origDT); system.time(red())},
          fun={DT=copy(origDT); system.time(fun(DT,n))},
          replications=3, order="relative")

test replications elapsed relative user.self sys.self user.child sys.child
 red            3   1.107    1.000     1.100    0.004          0         0
blue            3   5.797    5.237     5.660    0.120          0         0
 fun            3   8.255    7.457     8.041    0.184          0         0

crdt(1e6)
[ .. snip .. ]
test replications elapsed relative user.self sys.self user.child sys.child
 red            3  14.647    1.000    14.613    0.000          0         0
blue            3  87.589    5.980    87.197    0.124          0         0
 fun            3 197.243   13.466   195.240    0.644          0         0

identical(blueans[,list(idx,x,y,ok,y1)],redans[order(idx,y1)])
# [1] TRUE

The order is needed in the identical because red returns the result in the same order as DT[ok==0] whereas blue appears to be ordered by y1 in the case of ties in idx.

If y1 is unwanted in the result it can be removed instantly (regardless of table size) using ans[,y1:=NULL]; i.e., this can be included above to produce the exact result requested in question, without affecting the timings at all.

like image 143
Matt Dowle Avatar answered Sep 19 '22 12:09

Matt Dowle


library(data.table)

crdt <- function(i=10){
 set.seed(1)
 DT <<- data.table(idx=rep(1:i, each=5), x=rnorm(5*i), 
                   y=letters[1:5], ok=rbinom(5*i, 1, 0.90))
 n <<- data.table(y=letters[1:5], y1=letters[c(2:5,1)])
} 

fun <- function(DT,n){
 setkey(DT,ok)
 n1 <- merge(n,DT[J(0),list(y,idx)],by="y")
 DT[J(0),x:=DT[paste0(y,idx) %in% paste0(n1[,y1],n1[,idx]),x]]
} 

crdt(10)
fun(DT,n)[J(0)]
     ok idx          x y
[1,]  0   1  0.3295078 d
[2,]  0   2 -0.3053884 d
[3,]  0   3  0.3898432 a
[4,]  0   5  0.7821363 a
[5,]  0   7  1.3586796 e
[6,]  0   8  0.7631757 d

But it is still pretty slow for bigger data.tables:

crdt(1e6)
system.time(fun(DT,n)[J(0)])
       User      System     elapsed 
      4.213       0.162       4.374 

crdt(1e7)
system.time(fun(DT,n)[J(0)])
       User      System     elapsed 
    195.685       3.949     199.592 

I'm interested to learn a faster solution.

like image 20
Roland Avatar answered Sep 18 '22 12:09

Roland