Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining "by" and "on" to join and create summary column for data.table

Tags:

r

data.table

I have two datasets, one a detailed dataset of weight's and another that's supposed to be a summary dataset. I am trying to create the summary dataset by joining the detail dataset and aggregating, but it isn't working as expected.

Here's a sample code.

mytesta <- data.table(cola = c("a","b"), groupa = c(1,2))  # summary
mytestb <- data.table(groupa = c(1,1,1,1,2,2,2), weighta = c(10,20,30,25,15,30,10))  #detail

And this is my desired output.

   cola groupa weighta
1:    a      1      85
2:    b      2      55

What I tried to do is,

mytesta[mytestb, on = "groupa", weight_summary := sum(i.weighta), by = "groupa"]

The problem is that when by is used, the columns of the inner data.table disappear (for instance, mytesta[mytestb, on = "groupa", .SD, by = "groupa"]). Is there a way around this?

like image 891
Naumz Avatar asked Oct 18 '22 15:10

Naumz


2 Answers

I would do

mytesta[, v := mytestb[.SD, on=.(groupa), sum(weighta), by=.EACHI]$V1 ]

In a X[Y] join, we're looking up each row of Y in X.

So if the end goal is to create a new column in Y computed per row, we'll need a join Y[, v := X[Y, ...]] even though Y[X, v := ...] might seem more intuitive at first.

like image 98
Frank Avatar answered Oct 20 '22 09:10

Frank


Here is a solution where I first merge your two data.tables, then summarize.

tab = merge(mytesta, mytestb, by="groupa")
tab
#    groupa cola weighta
# 1:      1    a      10
# 2:      1    a      20
# 3:      1    a      30
# 4:      1    a      25
# 5:      2    b      15
# 6:      2    b      30
# 7:      2    b      10

res = tab[, list(weighta=sum(weighta)), by=list(cola, groupa)]
res
#    cola groupa weighta
# 1:    a      1      85
# 2:    b      2      55
like image 34
bdemarest Avatar answered Oct 20 '22 11:10

bdemarest