R data.table group by attribute of i table

Question

I want to use columns of the i table in a data.table join for both calculations and for grouping. There appears to be some limitation in the syntax for this. Can you suggest a cleaner way of doing this?

require(data.table)
set.seed(1)

Table 1

DT1 <- data.table(loc = c("L1","L2"), product = c("P1","P2","P3"), qty = runif(12))

Table 2

DT2 <- data.table(product = c("P1","P2","P3"), family = c("A","A","B"), price = c(5,7,10))

A straight join on the tables is fine: [not the issue here, but the requirement to use quoted column names in the on clause seems to be inconsistent for data.table]

DT1[DT2, on = "product"]
#    loc product       qty family price
# 1:  L1      P1 0.1297134      A     5
# 2:  L2      P1 0.2423550      A     5
# 3:  L1      P1 0.3421633      A     5
# 4:  L2      P1 0.6537663      A     5
# 5:  L2      P2 0.9822407      A     7
# 6:  L1      P2 0.8568853      A     7
# 7:  L2      P2 0.7062672      A     7
# 8:  L1      P2 0.9224086      A     7
# 9:  L1      P3 0.8267184      B    10
#10:  L2      P3 0.8408788      B    10
#11:  L1      P3 0.6212432      B    10
#12:  L2      P3 0.5363538      B    10

Calculation using columns of both tables is fine:

DT1[DT2, .(family, product, val = qty*price), on = "product"]
#    family product       val
# 1:      A      P1 0.6485671
# 2:      A      P1 1.2117750
# 3:      A      P1 1.7108164
# 4:      A      P1 3.2688313
# 5:      A      P2 6.8756851
# 6:      A      P2 5.9981971
# 7:      A      P2 4.9438704
# 8:      A      P2 6.4568599
# 9:      B      P3 8.2671841
#10:      B      P3 8.4087878
#11:      B      P3 6.2124323
#12:      B      P3 5.3635379

I can group and aggregate on .EACHI

DT1[DT2,.(family, product, val = sum(qty*price)), on = "product", by = .EACHI]
#   product family product      val
#1:      P1      A      P1  6.83999
#2:      P2      A      P1 24.27461
#3:      P3      B      P1 28.25194

But not using product

DT1[DT2,.(family, product, val = sum(qty*price)), on = "product", by = product]
#Error in `[.data.table`(DT1, DT2, .(family, product, val = sum(qty * price)),  : 
#object 'price' not found

In this case it ceases to find price on the i table.

The .EACHI is usable in this case, because the by element is a unique key to DT2.

However if I want to group by an attribute of DT2 I don't seem to be able to use the .EACHI reference. I have achieved what I want with the following:

DT1[DT2, .(family, product, val = qty*price), on = "product"][, .(sum(val)), by = family]
#   family       V1
#1:      A 31.11460
#2:      B 28.25194

Is this double-processing necessary or is there another piece of the syntax that I can use in this situation?

webb · Accepted Answer

I can tell you how I use R in production in my last two companies. I don't use the DT1[DT2] syntax to join because it is not explicit about the type of join, and it's counterintuitive in that the columns of DT1 come first in the result, but it is all the rows of DT2 that are preserved, i.e., a right join. Wrongly assuming it would be a left join with DT1 on the left led to production bugs from several developers. Instead, we now require developers to explicitly state inner vs left vs right vs full join. So instead, we use merge().

Instead of your:

DT1[DT2, .(family, product, val = qty*price), on = "product"][, .(sum(val)), by = family]
   family        V1
1:      A 20.053267
2:      B  9.928635

I would do

merge(DT1, DT2, by='product', all.x=F, all.y=T)[,.(val = sum(qty*price)), family]
   family       val
1:      A 20.053267
2:      B  9.928635

PS: I also tried with for(i in 1:20) DT1 = rbind(DT1,DT1) to make DT1 2^20 times as big, i.e., 12,582,912 rows, and couldn't find any significant difference in runtime for your code and mine. Mine has several fairly minor advantages: the type of join is explicit, merge() is familiar to data.frame users, and it's more concise.

R data.table group by attribute of i table

Tags:

r

data.table

sch56

1 Answers

webb

Recent Activity

Donate For Us

R data.table group by attribute of i table

Tags:

r

data.table

sch56

1 Answers

webb

Related questions

Recent Activity

Donate For Us