I want to add a new column to my data.table containing the cumulative product of Data1
based on the Date
. The cumulative product should be calculated for each category (Cat
) and should start with the latest available Date
.
Sample data:
DF = data.frame(Cat=rep(c("A","B"),each=4), Date=rep(c("01-08-2013","01-07-2013","01-04-2013","01-03-2013"),2), Data1=c(1:8))
DF$Date = as.Date(DF$Date , "%m-%d-%Y")
DT = data.table(DF)
DT[ , Data1_cum:=NA_real_]
DT
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 NA
2: A 2013-01-07 2 NA
3: A 2013-01-04 3 NA
4: A 2013-01-03 4 NA
5: B 2013-01-08 5 NA
6: B 2013-01-07 6 NA
7: B 2013-01-04 7 NA
8: B 2013-01-03 8 NA
The result should look like this:
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-07 2 2
3: A 2013-01-04 3 6
4: A 2013-01-03 4 24
5: B 2013-01-08 5 5
6: B 2013-01-07 6 30
7: B 2013-01-04 7 210
8: B 2013-01-03 8 1680
I figured out that I could do something similar using cumprod()
, but I do not know how to handle the categories. NA
s in Data1
should be ignored / treated as 1.
The real dataset has about 8 million rows and 1000 categories.
If the only looksissue is the ordering...
DT[order(Date, decreasing=TRUE), Data1_cum := cumprod(Data1), by=Cat]
DT
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-07 2 2
3: A 2013-01-04 3 6
4: A 2013-01-03 4 24
5: B 2013-01-08 5 5
6: B 2013-01-07 6 30
7: B 2013-01-04 7 210
8: B 2013-01-03 8 1680
Note: If you shuffle the order of the rows, your results can vary. Careful with how you implement the order(.)
command
## Let's add some NA values
DT <- rbind(DT, DT)
DT[c(2, 6, 11, 15), Data1 := NA]
# shuffle the rows, to make sure this is right
set.seed(1)
DT <- DT[sample(nrow(DT))]
Assigning the cumulative product:
## If you want to leave the NA's as NA's in the cum prod, use:
DT[ , Data1_cum := NA_real_ ]
DT[ intersect(order(Date, decreasing=TRUE), which(!is.na(Data1)))
, Data1_cum := cumprod(Data1)
, by=Cat]
# View the data, orderly
DT[order(Date, decreasing=TRUE)][order(Cat)]
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-08 1 1
3: A 2013-01-07 2 2
4: A 2013-01-07 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
5: A 2013-01-04 3 6
6: A 2013-01-04 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
7: A 2013-01-03 4 24
8: A 2013-01-03 4 96
9: B 2013-01-08 5 5
10: B 2013-01-08 5 25
11: B 2013-01-07 6 150
12: B 2013-01-07 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
13: B 2013-01-04 7 1050
14: B 2013-01-04 NA NA <~~~~~~~ Note that the NA rows have the value of the prev row
15: B 2013-01-03 8 8400
16: B 2013-01-03 8 67200
## If instead you want to treat the NA's as 1, use:
DT[order(Date, decreasing=TRUE), Data1_cum := {Data1[is.na(Data1)] <- 1; cumprod(Data1 [order(Date, decreasing=TRUE)] )}, by=Cat]
# View the data, orderly
DT[order(Date, decreasing=TRUE)][order(Cat)]
Cat Date Data1 Data1_cum
1: A 2013-01-08 1 1
2: A 2013-01-08 1 1
3: A 2013-01-07 2 2
4: A 2013-01-07 NA 2 <~~~~~~~ Rows with NA took on values of the previous Row
5: A 2013-01-04 3 6
6: A 2013-01-04 NA 6 <~~~~~~~ Rows with NA took on values of the previous Row
7: A 2013-01-03 4 24
8: A 2013-01-03 4 96
9: B 2013-01-08 5 5
10: B 2013-01-08 5 25
11: B 2013-01-07 6 150
12: B 2013-01-07 NA 150 <~~~~~~~ Rows with NA took on values of the previous Row
13: B 2013-01-04 7 1050
14: B 2013-01-04 NA 1050 <~~~~~~~ Rows with NA took on values of the previous Row
15: B 2013-01-03 8 8400
16: B 2013-01-03 8 67200
Alternatively, If you already have the cumulative product and simply want to remove the NA's you can do so as follows:
# fix the NA's with the previous value
DT[order(Date, decreasing=TRUE),
Data1_cum := {tmp <- c(0, head(Data1_cum, -1));
Data1_cum[is.na(Data1_cum)] <- tmp[is.na(Data1_cum)];
Data1_cum }
, by=Cat ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With