I need to make a time-series calculation, where the value calculated in each row depends on the result calculated in the previous row. I am hoping to use the convenience of data.table
. The actual problem is a hydrological model -- a cumulative water balance calculation, adding rainfall at each time step and subtracting runoff and evaporation as a function of the current water volume. The dataset includes different basins and scenarios (groups). Here I will use a simpler illustration of the problem.
A simplified example of the calculation looks like this, for each time step (row) i
:
v[i] <- a[i] + b[i] * v[i-1]
a
and b
are vectors of parameter values, and v
is the result vector. For the first row (i == 1
) the initial value of v
is taken as v0 = 0
.
My first thought was to use shift()
in data.table
. A minimal example, including the desired result v.ans
, is
library(data.table) # version 1.9.7
DT <- data.table(a = 1:4,
b = 0.1,
v.ans = c(1, 2.1, 3.21, 4.321) )
DT
# a b v.ans
# 1: 1 0.1 1.000
# 2: 2 0.1 2.100
# 3: 3 0.1 3.210
# 4: 4 0.1 4.321
DT[, v := NA] # initialize v
DT[, v := a + b * ifelse(is.na(shift(v)), 0, shift(v))][]
# a b v.ans v
# 1: 1 0.1 1.000 1
# 2: 2 0.1 2.100 2
# 3: 3 0.1 3.210 3
# 4: 4 0.1 4.321 4
This doesn't work, because shift(v)
gives a copy of the original column v
, shifted by 1 row. It is unaffected by assignment to v
.
I also considered building the equation using cumsum() and cumprod(), but that won't work either.
So I resort to a for loop inside a function for convenience:
vcalc <- function(a, b, v0 = 0) {
v <- rep(NA, length(a)) # initialize v
for (i in 1:length(a)) {
v[i] <- a[i] + b[i] * ifelse(i==1, v0, v[i-1])
}
return(v)
}
This cumulative function works fine with data.table:
DT[, v := vcalc(a, b, 0)][]
# a b v.ans v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
identical(DT$v, DT$v.ans)
# [1] TRUE
My question is, can I write this calculation in a more concise and efficient data.table
way, without having to use the for loop and/or function definition? Using set()
perhaps?
Or is there a better approach all together?
David's Rcpp solution below inspired me to remove the ifelse()
from the for
loop:
vcalc2 <- function(a, b, v0 = 0) {
v <- rep(NA, length(a))
for (i in 1:length(a)) {
v0 <- v[i] <- a[i] + b[i] * v0
}
return(v)
}
vcalc2()
is 60% faster than vcalc()
.
The cumulative frequency is calculated by adding each frequency from a frequency distribution table to the sum of its predecessors. The last value will always be equal to the total for all observations, since all frequencies will already have been added to the previous total.
The definition of the cumulative sum is the sum of a given sequence that is increasing or getting bigger with more additions. The real example of a cumulative sum is the increasing amount of water in a swing pool.
A running total in Excel (also known as cumulative sum) refers to the partial sum of a data set. It is a summation of a sequence of numbers that is refreshed every time a new number is added to the sequence.
It may not be 100% what you are looking for, as it does not use the "data.table-way" and still uses a for-loop. However, this approach should be faster (I assume you want to use data.table and the data.table-way to speed up your code). I leverage Rcpp to write a short function called HydroFun
, that can be used in R like any other function (you just need to source the function first). My gut-feeling tells me that the data.table way (if existent) is pretty complicated because you cannot compute a closed-form solution (but I may be wrong on this point...).
My approach looks like this:
The Rcpp function looks like this (in the file: hydrofun.cpp
):
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector HydroFun(NumericVector a, NumericVector b, double v0 = 0.0) {
// get the size of the vectors
int vecSize = a.length();
// initialize a numeric vector "v" (for the result)
NumericVector v(vecSize);
// compute v_0
v[0] = a[0] + b[0] * v0;
// loop through the vector and compute the new value
for (int i = 1; i < vecSize; ++i) {
v[i] = a[i] + b[i] * v[i - 1];
}
return v;
}
To source and use the function in R you can do:
Rcpp::sourceCpp("hydrofun.cpp")
library(data.table)
DT <- data.table(a = 1:4,
b = 0.1,
v.ans = c(1, 2.1, 3.21, 4.321))
DT[, v_ans2 := HydroFun(a, b, 0)]
DT
# a b v.ans v_ans2
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
Which gives the result you are looking for (at least from the value-perspective).
Comparing the speeds reveals a speed-up of roughly 65x.
library(microbenchmark)
n <- 10000
dt <- data.table(a = 1:n,
b = rnorm(n))
microbenchmark(dt[, v1 := vcalc(a, b, 0)],
dt[, v2 := HydroFun(a, b, 0)])
# Unit: microseconds
# expr min lq mean median uq max neval
# dt[, `:=`(v1, vcalc(a, b, 0))] 28369.672 30203.398 31883.9872 31651.566 32646.8780 68727.433 100
# dt[, `:=`(v2, HydroFun(a, b, 0))] 381.307 421.697 512.2957 512.717 560.8585 1496.297 100
identical(dt$v1, dt$v2)
# [1] TRUE
Does that help you in any way?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With