How to write a cumulative calculation in data.table

A sequential, cumulative calculation

I need to make a time-series calculation, where the value calculated in each row depends on the result calculated in the previous row. I am hoping to use the convenience of data.table. The actual problem is a hydrological model -- a cumulative water balance calculation, adding rainfall at each time step and subtracting runoff and evaporation as a function of the current water volume. The dataset includes different basins and scenarios (groups). Here I will use a simpler illustration of the problem.

A simplified example of the calculation looks like this, for each time step (row) i:

 v[i] <- a[i] + b[i] * v[i-1]

a and b are vectors of parameter values, and v is the result vector. For the first row (i == 1) the initial value of v is taken as v0 = 0.

First attempt

My first thought was to use shift() in data.table. A minimal example, including the desired result v.ans, is

library(data.table)        # version 1.9.7
DT <- data.table(a = 1:4, 
                 b = 0.1,
                 v.ans = c(1, 2.1, 3.21, 4.321) )
DT
#    a   b v.ans
# 1: 1 0.1 1.000
# 2: 2 0.1 2.100
# 3: 3 0.1 3.210
# 4: 4 0.1 4.321

DT[, v := NA]   # initialize v
DT[, v := a + b * ifelse(is.na(shift(v)), 0, shift(v))][]
#    a   b v.ans v
# 1: 1 0.1 1.000 1
# 2: 2 0.1 2.100 2
# 3: 3 0.1 3.210 3
# 4: 4 0.1 4.321 4

This doesn't work, because shift(v) gives a copy of the original column v, shifted by 1 row. It is unaffected by assignment to v.

I also considered building the equation using cumsum() and cumprod(), but that won't work either.

Brute force approach

So I resort to a for loop inside a function for convenience:

vcalc <- function(a, b, v0 = 0) {
  v <- rep(NA, length(a))      # initialize v
  for (i in 1:length(a)) {
    v[i] <- a[i] + b[i] * ifelse(i==1, v0, v[i-1])
  }
  return(v)
}

This cumulative function works fine with data.table:

DT[, v := vcalc(a, b, 0)][]
#    a   b v.ans     v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
identical(DT$v, DT$v.ans)
# [1] TRUE

My question

My question is, can I write this calculation in a more concise and efficient data.table way, without having to use the for loop and/or function definition? Using set() perhaps?

Or is there a better approach all together?

Edit: A better loop

David's Rcpp solution below inspired me to remove the ifelse() from the for loop:

vcalc2 <- function(a, b, v0 = 0) {
  v <- rep(NA, length(a))
  for (i in 1:length(a)) {
    v0 <- v[i] <- a[i] + b[i] * v0
  }
  return(v)
}

vcalc2() is 60% faster than vcalc().

795

asked Nov 03 '16 22:11

Douglas Clark

1 Answers

It may not be 100% what you are looking for, as it does not use the "data.table-way" and still uses a for-loop. However, this approach should be faster (I assume you want to use data.table and the data.table-way to speed up your code). I leverage Rcpp to write a short function called HydroFun, that can be used in R like any other function (you just need to source the function first). My gut-feeling tells me that the data.table way (if existent) is pretty complicated because you cannot compute a closed-form solution (but I may be wrong on this point...).

My approach looks like this:

The Rcpp function looks like this (in the file: hydrofun.cpp):

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector HydroFun(NumericVector a, NumericVector b, double v0 = 0.0) {
  // get the size of the vectors
  int vecSize = a.length();

  // initialize a numeric vector "v" (for the result)
  NumericVector v(vecSize);

   // compute v_0
  v[0] = a[0] + b[0] * v0;

  // loop through the vector and compute the new value
  for (int i = 1; i < vecSize; ++i) {
    v[i] = a[i] + b[i] * v[i - 1];
  }
  return v;
}

To source and use the function in R you can do:

Rcpp::sourceCpp("hydrofun.cpp")

library(data.table)
DT <- data.table(a = 1:4, 
                 b = 0.1,
                 v.ans = c(1, 2.1, 3.21, 4.321))

DT[, v_ans2 := HydroFun(a, b, 0)]
DT
# a   b v.ans v_ans2
# 1: 1 0.1 1.000  1.000
# 2: 2 0.1 2.100  2.100
# 3: 3 0.1 3.210  3.210
# 4: 4 0.1 4.321  4.321

Which gives the result you are looking for (at least from the value-perspective).

Comparing the speeds reveals a speed-up of roughly 65x.

library(microbenchmark)
n <- 10000
dt <- data.table(a = 1:n,
                 b = rnorm(n))

microbenchmark(dt[, v1 := vcalc(a, b, 0)],
               dt[, v2 := HydroFun(a, b, 0)])
# Unit: microseconds
# expr                                min        lq       mean    median         uq       max neval
# dt[, `:=`(v1, vcalc(a, b, 0))]    28369.672 30203.398 31883.9872 31651.566 32646.8780 68727.433   100
# dt[, `:=`(v2, HydroFun(a, b, 0))]   381.307   421.697   512.2957   512.717   560.8585  1496.297   100

identical(dt$v1, dt$v2)
# [1] TRUE

Does that help you in any way?

answered Oct 06 '22 07:10

David

Related questions
                            
                                adding a CSS Stylesheet in R shiny
                            
                                Why is := faster than `:=`()?
                            
                                Convert character to html in R
                            
                                Rmd/Kntir: Markdown citations in LaTeX environments
                            
                                How can you create a box around an axis tick label in ggplot2?
                            
                                adjusted bootstrap confidence intervals (BCa) with parametric bootstrap in boot package
                            
                                How to scale the size of line and point separately in ggplot2
                            
                                How to non-interactively quit RStudio gracefully
                            
                                RMarkdown: Floating TOC and TOC at beginning
                            
                                How to calculate the double integration in R
                            
                                Parallel computing with clusters other than snow SOCK
                            
                                Segment facet_wrap into multi-page PDF
                            
                                What does autoplot.microbenchmark actually plot?
                            
                                data.table throws "object not found" error [duplicate]
                            
                                How to remove white space above and below image in R Markdown?
                            
                                What are the steps in submitting an R-package to CRAN and how long does each step take? [closed]
                            
                                Using both color and size attributes in Hexagon Binning (ggplot2)
                            
                                Prevent partial argument matching
                            
                                How to tell if a plot is complete or not
                            
                                Computationally heavy R vignettes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write a cumulative calculation in data.table

Tags:

r

vectorization

data.table

time-series

difference-equations