Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cartesian product with filter data.table

Tags:

r

data.table

I'm trying to replace Cartesian product produced by SQL by data.table call. I have large history with assets and values, and I need a subset of all combinations. Let's say that I have table a with T = [date, contract, value]. In SQL it looks like

SELECT a.date, a.contract, a.value, b.contract. b.value 
FROM T a, T b
WHERE a.date = b.date AND a.contract <> b.contract AND a.value + b.value < 4

In R I have now the following

library(data.table)

n <- 1500
dt <- data.table(date     = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3),
                 contract = c(rep("a", n), rep("b", n), rep("c", n)),
                 value    = c(rep(1, n), rep(2, n), rep(3, n)))
setkey(dt, date)

dt[dt, allow.cartesian = TRUE][(contract != i.contract) & (value + i.value < 4)]

I believe that my solution creates all combinations first (in this case 13,500 rows) and then filter (to 3000). SQL however (and I might be wrong) joining subset, and what is more important don't load all combinations into RAM. Any ideas how to use data.table more efficient?

like image 707
kismsu Avatar asked Nov 21 '14 11:11

kismsu


People also ask

How to apply filter in DataTable?

Filtering DataTable varieties of ways include select(String) method, which selects the required row or column and then based on that applies the filter. Filtering can be done using Select, Where, AND, OR, NOT logical operator and on top of it applying the value also there.

How to filter rows in DataTable in uipath?

Select() method, you can directly assign filter rows to an array or data table using this expression. Create a type variable, either DataTable or DataRow [], that is an array of Data Rows. By default, this method returns the array of data rows but you can convert it at any time to a Data Table. You are done 😊.

What happens when you create a Cartesian product?

You get the multiplication result of two sets making all possible ordered pairs of the original sets' elements. The Cartesian product involves a large number of computational operations that are usually redundant.


1 Answers

Use by = .EACHI feature. In data.table joins and subsets are very closely linked; i.e., a join is just another subset - using data.table - instead of the usual integer / logical / row names. They are designed this way with these cases in mind.

Subset based joins allow to incorporate j-expressions and grouping operations together while joining.

require(data.table)
dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE]

This is the idiomatic way (in case you'd like to use i.* cols just for condition, but not return them as well), however, .SD has not yet been optimised, and evaluating the j-expression on .SD for each group is costly.

system.time(dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE])
#    user  system elapsed 
#   2.874   0.020   2.983 

Some cases using .SD have already been optimised. Until these cases are taken care of, you can workaround it this way:

dt[dt, {
        idx = contract != i.contract & value + i.value < 4L
        list(contract = contract[idx],
             value = value[idx], 
             i.contract = i.contract[any(idx)],
             i.value = i.value[any(idx)]
        )
       }, by = .EACHI, allow = TRUE]

And this takes 0.045 seconds, as opposed to 0.005 seconds from your method. But by = .EACHI evaluates the j-expression each time (and therefore memory efficient). That's the trade-off you'll have to accept.

like image 184
Arun Avatar answered Nov 15 '22 19:11

Arun