This is surprisingly difficult, but I am trying to do what the title says, for example suppose I have a data table dat
and I am trying to calculate the cumulative sum in a new column (from the 1st and 3rd, when it appears in the 2nd) of whatever group appears in the second column.
dat = data.table(A=c(1,2,3,1,4,5,1,2,3),B=c(1,1,1,NA,1,NA,2,NA,2),C=c(1,12,24.2,251,2,1,2,3,-1))
dat[,cumsum:=0]
So the data look like
> dat
A B C
1: 1 1 1.0
2: 2 1 12.0
3: 3 1 24.2
4: 1 NA 251.0
5: 4 1 2.0
6: 5 NA 1.0
7: 1 2 2.0
8: 2 NA 3.0
9: 3 2 -1.0
I want the output to be this:
> dat
A B C cumsum
1: 1 1 1.0 1
2: 2 1 12.0 1
3: 3 1 24.2 1
4: 1 NA 251.0 0
5: 4 1 2.0 252
6: 5 NA 1.0 0
7: 1 2 2.0 12
8: 2 NA 3.0 0
9: 3 2 -1.0 15
Is there an efficient data table way to do this? I could do this with loops but this would be quite slow, and I feel this must be doable in a more scalable way but I'm stuck.
A possible approach to use non equi self join:
dat[, rn := .I]
dat[!is.na(B), cumsum := dat[.SD, on=.(A=B, rn<=rn), sum(x.C), by=.EACHI]$V1]
output:
A B C cumsum rn
1: 1 1 1.0 1 1
2: 2 1 12.0 1 2
3: 3 1 24.2 1 3
4: 1 NA 251.0 0 4
5: 4 1 2.0 252 5
6: 5 NA 1.0 0 6
7: 1 2 2.0 12 7
8: 2 NA 3.0 0 8
9: 3 2 -1.0 15 9
data:
dat = data.table(A=c(1,2,3,1,4,5,1,2,3),B=c(1,1,1,NA,1,NA,2,NA,2),C=c(1,12,24.2,251,2,1,2,3,-1))
dat[,cumsum:=0]
edit: adding another approach inspired by Frank's answer
dat = data.table(A=c(1,2,3,1,4,5,1,2,3),B=c(1,1,1,NA,1,NA,2,NA,2),C=c(1,12,24.2,251,2,1,2,3,-1))
dat[, rn := .I][, cs := cumsum(C), A]
dat[, cumsum := 0][
!is.na(B), cumsum := dat[.SD, on=.(A=B, rn), allow.cartesian=TRUE, roll=TRUE, x.cs]]
Same idea as @chinsoon's answer, but with a rolling join:
dat[, rn := .I]
mDT = dat[.(setdiff(B, NA)), on=.(A), .(rn, v = cumsum(C)), by=.EACHI]
dat[, cumsum := 0]
dat[!is.na(B), cumsum := mDT[.SD, on=.(A=B, rn), roll=TRUE, x.v]]
A B C cumsum rn
1: 1 1 1.0 1 1
2: 2 1 12.0 1 2
3: 3 1 24.2 1 3
4: 1 NA 251.0 0 4
5: 4 1 2.0 252 5
6: 5 NA 1.0 0 6
7: 1 2 2.0 12 7
8: 2 NA 3.0 0 8
9: 3 2 -1.0 15 9
For every value of B
, mDT
has corresponding rows of A
with row number and cumsum. We look up the latest value of the cumsum by rolling to the most recent row number.
Here's another rolling join approach suggested by @chinsoon:
dat[, rn := .I]
dat[, cs := cumsum(C), by=A]
dat[, cumsum := 0]
dat[ !is.na(B), cumsum := dat[.SD, on=.(A=B, rn), allow.cartesian=TRUE, roll=TRUE, x.cs]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With