Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Data.Table Join on Conditionals

Tags:

r

data.table

I have two tables that I would like to join together in a way equivalent to the following SQL, where I join on multiple conditions, not just equality.

require(sqldf)
require(data.table)

dt <- data.table(num=c(1, 2, 3, 4, 5, 6), 
char=c('A', 'A', 'A', 'B', 'B', 'B'), 
bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))

dt_two <- data.table(
num =c(6, 1, 5, 2, 4, 3), 
char=c('A', 'A', 'A', 'B', 'B', 'B'), 
bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))


dt_out_sql <- sqldf('
    select dtone.num, dtone.char, dtone.bool, SUM(dttwo.num) as SUM,  
   MIN(dttwo.num) as MIN
    from dt as dtone INNER join dt_two as dttwo on 
    (dtone.char = dttwo.char) and 
    (dtone.num >= dttwo.num OR dtone.bool)
GROUP BY dtone.num, dtone.char, dtone.bool')

I would like to avoid the SQL solution, for both performance and flexibility reasons. The same goes for doing a cross join, and then filtering/aggregating -- it would create an intermediate table with lots of unnecessary records for me to filter out.

Thank you very much!

Update -- my initial example was done in haste. In my actual problem, I'm not doing a self join.

like image 510
Netbrian Avatar asked Feb 26 '15 16:02

Netbrian


2 Answers

Here's one way:

require(data.table)
setkey(dt, char)
setkey(dt_two, char)

dt_two[dt, {
   val = num[i.bool | i.num >= num]; 
   list(num=i.num, bool=i.bool, sum=sum(val), min=min(val))
}, by=.EACHI]
#    char num  bool sum min
# 1:    A   1  TRUE  12   1
# 2:    A   2 FALSE   1   1
# 3:    A   3  TRUE  12   1
# 4:    B   4 FALSE   9   2
# 5:    B   5  TRUE   9   2
# 6:    B   6 FALSE   9   2

To read about by=.EACHI, have a look at this post (until vignettes on joins are finished).

HTH

like image 186
Arun Avatar answered Nov 22 '22 07:11

Arun


It's a bit ugly but works:

library(data.table)
library(sqldf)

dt <- data.table(num=c(1, 2, 3, 4, 5, 6), 
                 char=c('A', 'A', 'A', 'B', 'B', 'B'), 
                 bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))

dt_two <- data.table(
  num =c(6, 1, 5, 2, 4, 3), 
  char=c('A', 'A', 'A', 'B', 'B', 'B'), 
  bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))


dt_out_sql <- sqldf('
    select dtone.num,
            dtone.char,
            dtone.bool,
            SUM(dttwo.num) as SUM,  
            MIN(dttwo.num) as MIN
    from    dt as dtone
    INNER join dt_two as dttwo on 
          (dtone.char = dttwo.char) and 
          (dtone.num >= dttwo.num OR dtone.bool)
    GROUP BY dtone.num, dtone.char, dtone.bool
  ')

setDT(dt_out_sql)

setkey(dt, char)
setkey(dt_two, char)

dt_out_r <- dt[dt_two,
               list(dtone.num = num,
                    dttwo.num = i.num,
                    char,
                    bool) ,
               nomatch = 0, allow.cartesian = T
               ][
                 dtone.num >= dttwo.num | bool,
                 list(SUM = sum(dttwo.num),
                      MIN = min(dttwo.num)),
                 by = list(num = dtone.num,
                           char,
                           bool)
                 ]

setkey(dt_out_r, num, char, bool)


all.equal(dt_out_sql, dt_out_r, check.attributes = FALSE)
like image 33
Michele Avatar answered Nov 22 '22 07:11

Michele