I want to use columns of the i table in a data.table join for both calculations and for grouping. There appears to be some limitation in the syntax for this. Can you suggest a cleaner way of doing this?
require(data.table)
set.seed(1)
Table 1
DT1 <- data.table(loc = c("L1","L2"), product = c("P1","P2","P3"), qty = runif(12))
Table 2
DT2 <- data.table(product = c("P1","P2","P3"), family = c("A","A","B"), price = c(5,7,10))
A straight join on the tables is fine: [not the issue here, but the requirement to use quoted column names in the on clause seems to be inconsistent for data.table]
DT1[DT2, on = "product"]
# loc product qty family price
# 1: L1 P1 0.1297134 A 5
# 2: L2 P1 0.2423550 A 5
# 3: L1 P1 0.3421633 A 5
# 4: L2 P1 0.6537663 A 5
# 5: L2 P2 0.9822407 A 7
# 6: L1 P2 0.8568853 A 7
# 7: L2 P2 0.7062672 A 7
# 8: L1 P2 0.9224086 A 7
# 9: L1 P3 0.8267184 B 10
#10: L2 P3 0.8408788 B 10
#11: L1 P3 0.6212432 B 10
#12: L2 P3 0.5363538 B 10
Calculation using columns of both tables is fine:
DT1[DT2, .(family, product, val = qty*price), on = "product"]
# family product val
# 1: A P1 0.6485671
# 2: A P1 1.2117750
# 3: A P1 1.7108164
# 4: A P1 3.2688313
# 5: A P2 6.8756851
# 6: A P2 5.9981971
# 7: A P2 4.9438704
# 8: A P2 6.4568599
# 9: B P3 8.2671841
#10: B P3 8.4087878
#11: B P3 6.2124323
#12: B P3 5.3635379
I can group and aggregate on .EACHI
DT1[DT2,.(family, product, val = sum(qty*price)), on = "product", by = .EACHI]
# product family product val
#1: P1 A P1 6.83999
#2: P2 A P1 24.27461
#3: P3 B P1 28.25194
But not using product
DT1[DT2,.(family, product, val = sum(qty*price)), on = "product", by = product]
#Error in `[.data.table`(DT1, DT2, .(family, product, val = sum(qty * price)), :
#object 'price' not found
In this case it ceases to find price on the i table.
The .EACHI is usable in this case, because the by element is a unique key to DT2.
However if I want to group by an attribute of DT2 I don't seem to be able to use the .EACHI reference. I have achieved what I want with the following:
DT1[DT2, .(family, product, val = qty*price), on = "product"][, .(sum(val)), by = family]
# family V1
#1: A 31.11460
#2: B 28.25194
Is this double-processing necessary or is there another piece of the syntax that I can use in this situation?
I can tell you how I use R in production in my last two companies. I don't use the DT1[DT2] syntax to join because it is not explicit about the type of join, and it's counterintuitive in that the columns of DT1 come first in the result, but it is all the rows of DT2 that are preserved, i.e., a right join. Wrongly assuming it would be a left join with DT1 on the left led to production bugs from several developers. Instead, we now require developers to explicitly state inner vs left vs right vs full join. So instead, we use merge().
Instead of your:
DT1[DT2, .(family, product, val = qty*price), on = "product"][, .(sum(val)), by = family]
family V1
1: A 20.053267
2: B 9.928635
I would do
merge(DT1, DT2, by='product', all.x=F, all.y=T)[,.(val = sum(qty*price)), family]
family val
1: A 20.053267
2: B 9.928635
PS: I also tried with for(i in 1:20) DT1 = rbind(DT1,DT1) to make DT1 2^20 times as big, i.e., 12,582,912 rows, and couldn't find any significant difference in runtime for your code and mine. Mine has several fairly minor advantages: the type of join is explicit, merge() is familiar to data.frame users, and it's more concise.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With