Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Discounted Cumulative Sum in R

I'm trying to calculate a discounted cumulative sum in which the later values are worth more.

Let's say I have the following dataset:

 dt <- data.table( "year" = c(79,80,81,82,83), "value" = c(5,2,6,8,9))  

> dt
   year value
1:   79     5
2:   80     2
3:   81     6
4:   82     8
5:   83     9

And I want the following output:

> dt2
year value     DCS    
1:   79     5  5.0000  
2:   80     2  6.5000 
3:   81     6 11.8500
4:   82     8 18.6650 
5:   83     9 25.7985 

The discounted cumulative sum (DCS) is calucuted by discounting the previous values at a 10% annual discount rate. So, for the first second line, the DCS value is given by 2 + 5*(0.9)^1. For the third line, the DCS is 6 + (0.9)^1*2 + (0.9)^2*5, and so on.

Formally, the discounted sum formula is given by:

enter image description here

Finally, a data.table solution is preferable if possible.

like image 667
lovestacksflow Avatar asked May 14 '20 15:05

lovestacksflow


People also ask

What is meant by cumulative sum?

Cumulative sums, or running totals, are used to display the total sum of data as it grows with time (or any other series or progression). This lets you view the total contribution so far of a given measure against time.

How do you calculate cumulative sales?

And you want to calculate the cumulative sum of the revenue for each customer. This is pretty simple. You can use Group By command to group the data by customer id. Then, select 'Create Window Calculations' -> Cumulative -> Sum (Total) from the column header menu of the 'revenue' column.


Video Answer


3 Answers

Here are 2 other options.

1) using Rcpp and then update by reference in data.table:

library(Rcpp)
cppFunction("
NumericVector dcs(NumericVector x, double disc) {
    int n = x.size();
    NumericVector res(n);
    res[0] = x[0];
    for (int i=1; i<n; i++) {
        res[i] += x[i] + res[i-1]*disc;
    }
    return res;
}")
dt[, DCS := dcs(value, 0.9)]

2) Or recursively in data.table:

s <- 0
dt[, dcs2 := {
       s <- value + s*0.9
       s
    }, 
    1L:nrow(dt)]

#or simply: s <- 0; dt[, dcs2 := s <- value + s*0.9, 1L:nrow(dt)]

output:

   year value     DCS    dcs2
1:   79     5  5.0000  5.0000
2:   80     2  6.5000  6.5000
3:   81     6 11.8500 11.8500
4:   82     8 18.6650 18.6650
5:   83     9 25.7985 25.7985

edit: in response to comment about grouping:

dt <- data.table(ID=c(1,1,2,2), value=1:4)
dt[, {
    n <- .N
    s <- 0;
    .SD[, {
        s <- value + s*0.9;
        s
      }, 
      1L:n]
  },  
  ID]

output:

   ID n  V1
1:  1 1 1.0
2:  1 2 2.9
3:  2 1 3.0
4:  2 2 6.7
like image 80
chinsoon12 Avatar answered Oct 19 '22 12:10

chinsoon12


Not a proper answer, but only a timing of the other answers. Hopefully this will be helpful to determine which option to choose:

Load Libraries

library(data.table)
library(Rcpp)

Create Dataset

set.seed(0L)
dt <- data.table(value = rpois(1e4, 100))

Create necessary functions

app_3 <- function(dt) {
  m <- matrix(0, nrow = nrow(dt), ncol = nrow(dt))
  v <- 0.9**(seq(nrow(dt)) - 1)
  m[lower.tri(m, diag = TRUE)] <- unlist(sapply(rev(seq_along(v)), function(k) head(v, k)))

  dt[, DCS3 := m %*% value]
}

system.time(
cppFunction("
NumericVector dcs(NumericVector x, double disc) {
    int n = x.size();
    NumericVector res(n);
    res[0] = x[0];
    for (int i=1; i<n; i++) {
        res[i] += x[i] + res[i-1]*disc;
    }
    return res;
}"))
#   user  system elapsed 
#   0.03    0.16   20.03 

Benchmark

res <- bench::mark(time_unit="s",
  app_1 = dt[, DCS1 := sapply(1:.N, function(k) sum(0.9**(k - 1:k)*head(value, k)))],
  app_2 = dt[, DCS2 := dt[, Reduce(function(x, y) 0.9 * x + y, as.list(value), accumulate = TRUE)]],
  app_3 = app_3(dt),

  dt_rcpp = dt[, DCS4 := dcs(value, 0.9)],
  dt_recursive = {s <- 0
  dt[, DCS5 := {
    s <- value + s*0.9
    s
  }, 1L:nrow(dt)]
  },

  min_time = 1
)

res

timings:

# A tibble: 5 x 13
  expression                   min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result       memory      time   gc        
  <bch:expr>                 <dbl>    <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl> <list>       <list>      <list> <list>    
1 app_1                   6.34     6.34         0.158    1.12GB    0.315     1     2      6.34  <df[,7] [10~ <df[,3] [5~ <bch:~ <tibble [~
2 app_2                   0.0109   0.0123      71.3    612.34KB   21.8      72    22      1.01  <df[,7] [10~ <df[,3] [2~ <bch:~ <tibble [~
3 app_3                   3.93     3.93         0.255     4.1GB    0.764     1     3      3.93  <df[,7] [10~ <df[,3] [2~ <bch:~ <tibble [~
4 dt_rcpp                 0.000308 0.000337  2681.     195.46KB    6.01   2679     6      0.999 <df[,7] [10~ <df[,3] [2~ <bch:~ <tibble [~
5 dt_recursive            0.00939  0.00972     99.2    294.52KB    6.94    100     7      1.01  <df[,7] [10~ <df[,3] [3~ <bch:~ <tibble [~

Another timings with 1e6 rows:

# A tibble: 3 x 13
  expression                  min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory       time   gc        
  <bch:expr>                <dbl>   <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl> <list>        <list>       <list> <list>    
1 app_2                   1.52    1.52        0.659    53.5MB    6.59      1    10       1.52 <df[,5] [1,0~ <df[,3] [27~ <bch:~ <tibble [~
2 dt_rcpp                 0.00731 0.00942    89.9      15.3MB    0.899   100     1       1.11 <df[,5] [1,0~ <df[,3] [20~ <bch:~ <tibble [~
3 dt_recursive            0.902   0.905       1.10     22.9MB    1.66      2     3       1.81 <df[,5] [1,0~ <df[,3] [4,~ <bch:~ <tibble [~

Created on 2020-05-15 by the reprex package (v0.3.0)

like image 38
David Avatar answered Oct 19 '22 11:10

David


Maybe you can try the code below.


Approach 1

Follow the formula straightforwardly via using sum

dt[,DCS:=sapply(1:.N,function(k) sum(0.9**(k-1:k)*head(value,k)))]

Approach 2

Use Reduce from base R

dt[,Reduce(function(x,y) 0.9*x+y,as.list(value),accumulate = TRUE)]

Approach 3

  • First, you can construct a matrix m that gives convolution-like coefficients
m <- matrix(0,nrow = nrow(dt),ncol = nrow(dt))
v <- 0.9**(seq(nrow(dt))-1)
m[lower.tri(m,diag = TRUE)] <- unlist(sapply(rev(seq_along(v)),function(k) head(v,k)))

or use shift to obtain matrix m (Thanks to @chinsoon12)

x <- 0L:(nrow(dt)-1L); 
m <- t(do.call(cbind, shift(0.9^x, x, fill=0)))
  • Then you can run
dt[,DCS:=m%*%value]

Result

> dt
   year value     DCS
1:   79     5  5.0000
2:   80     2  6.5000
3:   81     6 11.8500
4:   82     8 18.6650
5:   83     9 25.7985
like image 4
ThomasIsCoding Avatar answered Oct 19 '22 11:10

ThomasIsCoding