I'm trying to replace Cartesian product produced by SQL by data.table call. I have large history with assets and values, and I need a subset of all combinations. Let's say that I have table a with T = [date, contract, value]. In SQL it looks like <pre class="prettyprint"><code>SELECT a.date, a.contract, a.value, b.contract. b.value FROM T a, T b WHERE a.date = b.date AND a.contract <> b.contract AND a.value + b.value < 4 </code></pre> In R I have now the following <pre class="prettyprint"><code>library(data.table) n <- 1500 dt <- data.table(date = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3), contract = c(rep("a", n), rep("b", n), rep("c", n)), value = c(rep(1, n), rep(2, n), rep(3, n))) setkey(dt, date) dt[dt, allow.cartesian = TRUE][(contract != i.contract) & (value + i.value < 4)] </code></pre> I believe that my solution creates all combinations first (in this case 13,500 rows) and then filter (to 3000). SQL however (and I might be wrong) joining subset, and what is more important don't load all combinations into RAM. Any ideas how to use data.table more efficient?

Use by = .EACHI feature. In <code>data.table</code> joins and subsets are very closely linked; i.e., a join is just another subset - using <code>data.table</code> - instead of the usual integer / logical / row names. They are designed this way with these cases in mind. Subset based joins allow to incorporate <code>j</code>-expressions and grouping operations together while joining. <pre class="prettyprint"><code>require(data.table) dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE] </code></pre> This is the idiomatic way (in case you'd like to use <code>i.*</code> cols just for condition, but not return them as well), however, <code>.SD</code> has not yet been optimised, and evaluating the <code>j</code>-expression on <code>.SD</code> for each group is costly. <pre class="prettyprint"><code>system.time(dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE]) # user system elapsed # 2.874 0.020 2.983 </code></pre> Some cases using <code>.SD</code> have already been optimised. Until these cases are taken care of, you can workaround it this way: <pre class="prettyprint"><code>dt[dt, { idx = contract != i.contract & value + i.value < 4L list(contract = contract[idx], value = value[idx], i.contract = i.contract[any(idx)], i.value = i.value[any(idx)] ) }, by = .EACHI, allow = TRUE] </code></pre> And this takes 0.045 seconds, as opposed to 0.005 seconds from your method. But <code>by = .EACHI</code> evaluates the <code>j</code>-expression each time (and therefore memory efficient). That's the trade-off you'll have to accept.

Cartesian product with filter data.table

Tags:

r

data.table

I'm trying to replace Cartesian product produced by SQL by data.table call. I have large history with assets and values, and I need a subset of all combinations. Let's say that I have table a with T = [date, contract, value]. In SQL it looks like

SELECT a.date, a.contract, a.value, b.contract. b.value 
FROM T a, T b
WHERE a.date = b.date AND a.contract <> b.contract AND a.value + b.value < 4

In R I have now the following

library(data.table)

n <- 1500
dt <- data.table(date     = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3),
                 contract = c(rep("a", n), rep("b", n), rep("c", n)),
                 value    = c(rep(1, n), rep(2, n), rep(3, n)))
setkey(dt, date)

dt[dt, allow.cartesian = TRUE][(contract != i.contract) & (value + i.value < 4)]

I believe that my solution creates all combinations first (in this case 13,500 rows) and then filter (to 3000). SQL however (and I might be wrong) joining subset, and what is more important don't load all combinations into RAM. Any ideas how to use data.table more efficient?

707

asked Nov 21 '14 11:11

kismsu

1 Answers

Use by = .EACHI feature. In data.table joins and subsets are very closely linked; i.e., a join is just another subset - using data.table - instead of the usual integer / logical / row names. They are designed this way with these cases in mind.

Subset based joins allow to incorporate j-expressions and grouping operations together while joining.

require(data.table)
dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE]

This is the idiomatic way (in case you'd like to use i.* cols just for condition, but not return them as well), however, .SD has not yet been optimised, and evaluating the j-expression on .SD for each group is costly.

system.time(dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE])
#    user  system elapsed 
#   2.874   0.020   2.983

Some cases using .SD have already been optimised. Until these cases are taken care of, you can workaround it this way:

dt[dt, {
        idx = contract != i.contract & value + i.value < 4L
        list(contract = contract[idx],
             value = value[idx], 
             i.contract = i.contract[any(idx)],
             i.value = i.value[any(idx)]
        )
       }, by = .EACHI, allow = TRUE]

And this takes 0.045 seconds, as opposed to 0.005 seconds from your method. But by = .EACHI evaluates the j-expression each time (and therefore memory efficient). That's the trade-off you'll have to accept.

184

answered Nov 15 '22 19:11

Arun

Related questions
                            
                                How to get more than one background color on ggplot2 plot area?
                            
                                Expand top of scale/axis to include text
                            
                                Can markdown expressions and results be interleaved in the same block?
                            
                                store and retrieve matrices in memory using xptr
                            
                                R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )
                            
                                append multiple large data.table's; custom data coercion using colClasses and fread; named pipes
                            
                                Label outliers in an scatter plot
                            
                                DEoptim stack imbalance problems
                            
                                Recursive eval in the global environment
                            
                                R fread and strip white
                            
                                r - read.csv - skip rows with different number of columns
                            
                                Different accuracies across different svm libraries with same parameters on same data
                            
                                unique rows in dplyr : row_number() from tbl_dt inconsistent to tbl_df
                            
                                PDF rendering in Rmarkdown2/
                            
                                how to create a hyperlink interactively in shiny app?
                            
                                How can I remove all custom methods and classes from an R workspace?
                            
                                Enclosing variables within for loop
                            
                                ggplot2 stat_function - can we use the generated y values on other layers
                            
                                Printing like a character but sorting like numeric in Shiny and DataTable
                            
                                Pipe in magrittr package is not working for function load()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With