Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate only successive rows in data.table

Tags:

r

data.table

I have a data.table with roughly 30 columns and 100 million rows. The data contain several blocks of rows where each row in a block has the same value in three particular columns I am interested in. Here is an illustrative example where the columns I am interested in are Time, Fruit, and Colour:

dt <- data.table(Time = c(100, rep(101, 4), rep(102, 2), 103:105), 
                   Ref = 1:10, 
                   Fruit = c(rep('banana', 2), 'apple', rep('banana', 2), 
                             rep('orange', 2), 'banana', rep('apple', 2)), 
                   Colour = c('green', 'yellow', 'red', rep('yellow', 2), 
                              rep('blue', 2), 'red', 'green', 'red'), 
                   Price = c(rep(1, 3), 2, 4, 3, 1, rep(5, 3)))
dt

#    Time Ref  Fruit Colour Price
# 1:  100   1 banana  green     1
# 2:  101   2 banana yellow     1
# 3:  101   3  apple    red     1
# 4:  101   4 banana yellow     2
# 5:  101   5 banana yellow     4
# 6:  102   6 orange   blue     3
# 7:  102   7 orange   blue     1
# 8:  103   8 banana    red     5
# 9:  104   9  apple  green     5
#10:  105  10  apple    red     5

This example contains two blocks. The first consists of the 101-banana-yellow rows 4 and 5, and the second consists of the 102-orange-blue rows 6 and 7. Note that even though row 2 matches rows 4 and 5 on Time, Fruit, and Colour, I do not want to include it as part of a block since row 3 is different from 2, 4, and 5, and breaks the chain of successive, matching rows.

Once I have found these blocks I want to combine the blocks in such a way that for most columns only the value from the last row in the block remains, while for other columns I want to sum up the values in all rows. In this example Ref should display the last value while Price should sum up, so my desired output is:

#    Time Ref  Fruit Colour Price
# 1:  100   1 banana  green     1
# 2:  101   2 banana yellow     1
# 3:  101   3  apple    red     1
# 4:  101   5 banana yellow     6
# 5:  102   7 orange   blue     4
# 6:  103   8 banana    red     5
# 7:  104   9  apple  green     5
# 8:  105  10  apple    red     5

I tried doing this with data.table's by feature, but I can't get the desired output:

byMethod <- dt[, list(Ref = tail(Ref, 1), Price = sum(Price)), by = list(Time, Fruit, Colour)]
setcolorder(byMethod, c('Time', 'Ref', 'Fruit', 'Colour', 'Price'))
byMethod

#    Time Ref  Fruit Colour Price
# 1:  100   1 banana  green     1
# 2:  101   5 banana yellow     7
# 3:  101   3  apple    red     1
# 4:  102   7 orange   blue     4
# 5:  103   8 banana    red     5
# 6:  104   9  apple  green     5
# 7 :  105  10  apple    red     5

This works fine with the 102-orange-blue block in the example, but it doesn't produce the result I want for the 101-banana-yellow block because it includes row 2 in this block when I don't want to.

Can anyone help me out here?

like image 546
aleksi Avatar asked Mar 21 '23 17:03

aleksi


1 Answers

Is this fast enough?

#create an index
dt[,i:=.I]
#group adjacent indices together
dt[, g:=cumsum(c(1, (diff(i) > 1))), by=list(Time, Fruit, Colour)]
#sum prices
dt[, list(Ref=tail(Ref, 1), Price=sum(Price)), 
   by=list(Time, Fruit, Colour, g)]

#    Time  Fruit Colour g Ref Price
# 1:  100 banana  green 1   1     1
# 2:  101 banana yellow 1   2     1
# 3:  101  apple    red 1   3     1
# 4:  101 banana yellow 2   5     6
# 5:  102 orange   blue 1   7     4
# 6:  103 banana    red 1   8     5
# 7:  104  apple  green 1   9     5
# 8:  105  apple    red 1  10     5
like image 192
Roland Avatar answered Mar 23 '23 11:03

Roland