I have a data.table with roughly 30 columns and 100 million rows. The data contain several blocks of rows where each row in a block has the same value in three particular columns I am interested in. Here is an illustrative example where the columns I am interested in are Time, Fruit, and Colour:
dt <- data.table(Time = c(100, rep(101, 4), rep(102, 2), 103:105),
Ref = 1:10,
Fruit = c(rep('banana', 2), 'apple', rep('banana', 2),
rep('orange', 2), 'banana', rep('apple', 2)),
Colour = c('green', 'yellow', 'red', rep('yellow', 2),
rep('blue', 2), 'red', 'green', 'red'),
Price = c(rep(1, 3), 2, 4, 3, 1, rep(5, 3)))
dt
# Time Ref Fruit Colour Price
# 1: 100 1 banana green 1
# 2: 101 2 banana yellow 1
# 3: 101 3 apple red 1
# 4: 101 4 banana yellow 2
# 5: 101 5 banana yellow 4
# 6: 102 6 orange blue 3
# 7: 102 7 orange blue 1
# 8: 103 8 banana red 5
# 9: 104 9 apple green 5
#10: 105 10 apple red 5
This example contains two blocks. The first consists of the 101-banana-yellow
rows 4 and 5, and the second consists of the 102-orange-blue
rows 6 and 7. Note that even though row 2 matches rows 4 and 5 on Time, Fruit, and Colour, I do not want to include it as part of a block since row 3 is different from 2, 4, and 5, and breaks the chain of successive, matching rows.
Once I have found these blocks I want to combine the blocks in such a way that for most columns only the value from the last row in the block remains, while for other columns I want to sum up the values in all rows. In this example Ref should display the last value while Price should sum up, so my desired output is:
# Time Ref Fruit Colour Price
# 1: 100 1 banana green 1
# 2: 101 2 banana yellow 1
# 3: 101 3 apple red 1
# 4: 101 5 banana yellow 6
# 5: 102 7 orange blue 4
# 6: 103 8 banana red 5
# 7: 104 9 apple green 5
# 8: 105 10 apple red 5
I tried doing this with data.table's by
feature, but I can't get the desired output:
byMethod <- dt[, list(Ref = tail(Ref, 1), Price = sum(Price)), by = list(Time, Fruit, Colour)]
setcolorder(byMethod, c('Time', 'Ref', 'Fruit', 'Colour', 'Price'))
byMethod
# Time Ref Fruit Colour Price
# 1: 100 1 banana green 1
# 2: 101 5 banana yellow 7
# 3: 101 3 apple red 1
# 4: 102 7 orange blue 4
# 5: 103 8 banana red 5
# 6: 104 9 apple green 5
# 7 : 105 10 apple red 5
This works fine with the 102-orange-blue
block in the example, but it doesn't produce the result I want for the 101-banana-yellow
block because it includes row 2 in this block when I don't want to.
Can anyone help me out here?
Is this fast enough?
#create an index
dt[,i:=.I]
#group adjacent indices together
dt[, g:=cumsum(c(1, (diff(i) > 1))), by=list(Time, Fruit, Colour)]
#sum prices
dt[, list(Ref=tail(Ref, 1), Price=sum(Price)),
by=list(Time, Fruit, Colour, g)]
# Time Fruit Colour g Ref Price
# 1: 100 banana green 1 1 1
# 2: 101 banana yellow 1 2 1
# 3: 101 apple red 1 3 1
# 4: 101 banana yellow 2 5 6
# 5: 102 orange blue 1 7 4
# 6: 103 banana red 1 8 5
# 7: 104 apple green 1 9 5
# 8: 105 apple red 1 10 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With