I'm trying to calculate a discounted cumulative sum in which the later values are worth more.
Let's say I have the following dataset:
dt <- data.table( "year" = c(79,80,81,82,83), "value" = c(5,2,6,8,9))
> dt
year value
1: 79 5
2: 80 2
3: 81 6
4: 82 8
5: 83 9
And I want the following output:
> dt2
year value DCS
1: 79 5 5.0000
2: 80 2 6.5000
3: 81 6 11.8500
4: 82 8 18.6650
5: 83 9 25.7985
The discounted cumulative sum (DCS) is calucuted by discounting the previous values at a 10% annual discount rate. So, for the first second line, the DCS value is given by 2 + 5*(0.9)^1. For the third line, the DCS is 6 + (0.9)^1*2 + (0.9)^2*5, and so on.
Formally, the discounted sum formula is given by:
Finally, a data.table solution is preferable if possible.
Cumulative sums, or running totals, are used to display the total sum of data as it grows with time (or any other series or progression). This lets you view the total contribution so far of a given measure against time.
And you want to calculate the cumulative sum of the revenue for each customer. This is pretty simple. You can use Group By command to group the data by customer id. Then, select 'Create Window Calculations' -> Cumulative -> Sum (Total) from the column header menu of the 'revenue' column.
Here are 2 other options.
1) using Rcpp
and then update by reference in data.table
:
library(Rcpp)
cppFunction("
NumericVector dcs(NumericVector x, double disc) {
int n = x.size();
NumericVector res(n);
res[0] = x[0];
for (int i=1; i<n; i++) {
res[i] += x[i] + res[i-1]*disc;
}
return res;
}")
dt[, DCS := dcs(value, 0.9)]
2) Or recursively in data.table
:
s <- 0
dt[, dcs2 := {
s <- value + s*0.9
s
},
1L:nrow(dt)]
#or simply: s <- 0; dt[, dcs2 := s <- value + s*0.9, 1L:nrow(dt)]
output:
year value DCS dcs2
1: 79 5 5.0000 5.0000
2: 80 2 6.5000 6.5000
3: 81 6 11.8500 11.8500
4: 82 8 18.6650 18.6650
5: 83 9 25.7985 25.7985
edit: in response to comment about grouping:
dt <- data.table(ID=c(1,1,2,2), value=1:4)
dt[, {
n <- .N
s <- 0;
.SD[, {
s <- value + s*0.9;
s
},
1L:n]
},
ID]
output:
ID n V1
1: 1 1 1.0
2: 1 2 2.9
3: 2 1 3.0
4: 2 2 6.7
Not a proper answer, but only a timing of the other answers. Hopefully this will be helpful to determine which option to choose:
library(data.table)
library(Rcpp)
set.seed(0L)
dt <- data.table(value = rpois(1e4, 100))
app_3 <- function(dt) {
m <- matrix(0, nrow = nrow(dt), ncol = nrow(dt))
v <- 0.9**(seq(nrow(dt)) - 1)
m[lower.tri(m, diag = TRUE)] <- unlist(sapply(rev(seq_along(v)), function(k) head(v, k)))
dt[, DCS3 := m %*% value]
}
system.time(
cppFunction("
NumericVector dcs(NumericVector x, double disc) {
int n = x.size();
NumericVector res(n);
res[0] = x[0];
for (int i=1; i<n; i++) {
res[i] += x[i] + res[i-1]*disc;
}
return res;
}"))
# user system elapsed
# 0.03 0.16 20.03
res <- bench::mark(time_unit="s",
app_1 = dt[, DCS1 := sapply(1:.N, function(k) sum(0.9**(k - 1:k)*head(value, k)))],
app_2 = dt[, DCS2 := dt[, Reduce(function(x, y) 0.9 * x + y, as.list(value), accumulate = TRUE)]],
app_3 = app_3(dt),
dt_rcpp = dt[, DCS4 := dcs(value, 0.9)],
dt_recursive = {s <- 0
dt[, DCS5 := {
s <- value + s*0.9
s
}, 1L:nrow(dt)]
},
min_time = 1
)
res
timings:
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl> <int> <dbl> <dbl> <list> <list> <list> <list>
1 app_1 6.34 6.34 0.158 1.12GB 0.315 1 2 6.34 <df[,7] [10~ <df[,3] [5~ <bch:~ <tibble [~
2 app_2 0.0109 0.0123 71.3 612.34KB 21.8 72 22 1.01 <df[,7] [10~ <df[,3] [2~ <bch:~ <tibble [~
3 app_3 3.93 3.93 0.255 4.1GB 0.764 1 3 3.93 <df[,7] [10~ <df[,3] [2~ <bch:~ <tibble [~
4 dt_rcpp 0.000308 0.000337 2681. 195.46KB 6.01 2679 6 0.999 <df[,7] [10~ <df[,3] [2~ <bch:~ <tibble [~
5 dt_recursive 0.00939 0.00972 99.2 294.52KB 6.94 100 7 1.01 <df[,7] [10~ <df[,3] [3~ <bch:~ <tibble [~
Another timings with 1e6 rows:
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl> <int> <dbl> <dbl> <list> <list> <list> <list>
1 app_2 1.52 1.52 0.659 53.5MB 6.59 1 10 1.52 <df[,5] [1,0~ <df[,3] [27~ <bch:~ <tibble [~
2 dt_rcpp 0.00731 0.00942 89.9 15.3MB 0.899 100 1 1.11 <df[,5] [1,0~ <df[,3] [20~ <bch:~ <tibble [~
3 dt_recursive 0.902 0.905 1.10 22.9MB 1.66 2 3 1.81 <df[,5] [1,0~ <df[,3] [4,~ <bch:~ <tibble [~
Created on 2020-05-15 by the reprex package (v0.3.0)
Maybe you can try the code below.
Follow the formula straightforwardly via using sum
dt[,DCS:=sapply(1:.N,function(k) sum(0.9**(k-1:k)*head(value,k)))]
Use Reduce
from base R
dt[,Reduce(function(x,y) 0.9*x+y,as.list(value),accumulate = TRUE)]
m
that gives convolution-like coefficientsm <- matrix(0,nrow = nrow(dt),ncol = nrow(dt))
v <- 0.9**(seq(nrow(dt))-1)
m[lower.tri(m,diag = TRUE)] <- unlist(sapply(rev(seq_along(v)),function(k) head(v,k)))
or use shift
to obtain matrix m
(Thanks to @chinsoon12)
x <- 0L:(nrow(dt)-1L);
m <- t(do.call(cbind, shift(0.9^x, x, fill=0)))
dt[,DCS:=m%*%value]
> dt
year value DCS
1: 79 5 5.0000
2: 80 2 6.5000
3: 81 6 11.8500
4: 82 8 18.6650
5: 83 9 25.7985
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With