I'm trying to replace Cartesian product produced by SQL by data.table call. I have large history with assets and values, and I need a subset of all combinations. Let's say that I have table a with T = [date, contract, value]. In SQL it looks like
SELECT a.date, a.contract, a.value, b.contract. b.value
FROM T a, T b
WHERE a.date = b.date AND a.contract <> b.contract AND a.value + b.value < 4
In R I have now the following
library(data.table)
n <- 1500
dt <- data.table(date = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3),
contract = c(rep("a", n), rep("b", n), rep("c", n)),
value = c(rep(1, n), rep(2, n), rep(3, n)))
setkey(dt, date)
dt[dt, allow.cartesian = TRUE][(contract != i.contract) & (value + i.value < 4)]
I believe that my solution creates all combinations first (in this case 13,500 rows) and then filter (to 3000). SQL however (and I might be wrong) joining subset, and what is more important don't load all combinations into RAM. Any ideas how to use data.table more efficient?
Filtering DataTable varieties of ways include select(String) method, which selects the required row or column and then based on that applies the filter. Filtering can be done using Select, Where, AND, OR, NOT logical operator and on top of it applying the value also there.
Select() method, you can directly assign filter rows to an array or data table using this expression. Create a type variable, either DataTable or DataRow [], that is an array of Data Rows. By default, this method returns the array of data rows but you can convert it at any time to a Data Table. You are done 😊.
You get the multiplication result of two sets making all possible ordered pairs of the original sets' elements. The Cartesian product involves a large number of computational operations that are usually redundant.
Use by = .EACHI feature. In data.table
joins and subsets are very closely linked; i.e., a join is just another subset - using data.table
- instead of the usual integer / logical / row names. They are designed this way with these cases in mind.
Subset based joins allow to incorporate j
-expressions and grouping operations together while joining.
require(data.table)
dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE]
This is the idiomatic way (in case you'd like to use i.*
cols just for condition, but not return them as well), however, .SD
has not yet been optimised, and evaluating the j
-expression on .SD
for each group is costly.
system.time(dt[dt, .SD[contract != i.contract & value + i.value < 4L], by = .EACHI, allow = TRUE])
# user system elapsed
# 2.874 0.020 2.983
Some cases using .SD
have already been optimised. Until these cases are taken care of, you can workaround it this way:
dt[dt, {
idx = contract != i.contract & value + i.value < 4L
list(contract = contract[idx],
value = value[idx],
i.contract = i.contract[any(idx)],
i.value = i.value[any(idx)]
)
}, by = .EACHI, allow = TRUE]
And this takes 0.045 seconds, as opposed to 0.005 seconds from your method. But by = .EACHI
evaluates the j
-expression each time (and therefore memory efficient). That's the trade-off you'll have to accept.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With