Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative sum based on certain conditions

Tags:

r

dplyr

This is my data frame:

       X   Y  Date   Qty  CumSumA  CumSumB
    1  A   B   1/1     1        1        0
    2  A   A   1/1     2        3        2
    3  A   E   1/1     2        5        2
    4  B   A   1/1     1        1        1
    5  B   B   1/1     3        4        4
    6  B   C   1/1     2        6        4
    7  C   D   1/1     2        2        2
    8  C   E   1/1     4        6        2
    9  C   A   1/1     1        7        2
   10  A   C   1/2     2        2        0
   11  A   D   1/2     3        5        0
   12  A   E   1/2     2        7        0
   13  B   A   1/2     5        5        0
   14  B   B   1/2     1        6        1
   15  B   C   1/2     2        8        1
   16  C   D   1/2     2        2        4
   17  C   E   1/2     1        1        4
   18  C   A   1/2     3        4        4

I get the CumSumA column with

library(dplyr)
data <- data %>% 
        group_by(Date,X) %>% 
        mutate(CumSumA= cumsum(Qty)) 

How can I get CumSumB column such that it is the cumulative sum of Qty for all rows above that have (a) the same Date value and (b) the same row X value in column Y.

So for example, row 16 has X value C and Date value 1/2. I want to get the cumulative sum of Qty of all rows with Y value C and Date value 1/2. So this would be rows 10 plus 15, so CumSumB is 2 + 2 = 4.

Note there are over 140 unique variables for column X and Y.

like image 789
user5316628 Avatar asked Sep 09 '15 10:09

user5316628


1 Answers

This solution is build on data.table and a join with allow.cartesian=TRUE

require(data.table)
setDT(DT)

Creating a base data.table whose X column we gonna use later on.

DT_X <- DT[,.(X,Y, Date, indx = .I)]
setkey(DT_X, Date, X)

Dropping X and inserting an index in the original DT

DT[,`:=`(X=NULL, indy = .I)]
setkey(DT, Date, Y)

Joining the data if X = Y (with allow.cartesian=TRUE). Have a look at DT_join if you are curious. See Why does X[Y] join of data.tables not allow a full outer join, or a left join? why this is a join

DT_join <- DT_X[DT, allow.cartesian=TRUE]

indy<=indx is an identifier to only take the sum of "all rows above" as you put it.

DT_join[!is.na(Y), .(CumSumB=sum(Qty * (indy<=indx))), by=.(X,Y,Date)]

Edit (based on aosmith Answer): Instead of by=.(X,Y,Date) one could also use by=indx

Result:

    X Y Date CumSumB
 1: A B  1/1       0
 2: A A  1/1       2
 3: A E  1/1       2
 4: B A  1/1       1
 5: B B  1/1       4
 6: B C  1/1       4
 7: C D  1/1       2
 8: C E  1/1       2
 9: C A  1/1       2
10: A C  1/2       0
11: A D  1/2       0
12: A E  1/2       0
13: B A  1/2       0
14: B B  1/2       1
15: B C  1/2       1
16: C D  1/2       4
17: C E  1/2       4
18: C A  1/2       4
like image 180
Rentrop Avatar answered Sep 23 '22 00:09

Rentrop