I have a very large time series and I need to create a different time series based on some arbitrary value in the beginning and change in the current time period. In the real dataset this change depends on other variables of the data frame, but for a purpose of MWE, I recreate it as follows:
initial_value <- 100
set.seed(123)
library(data.table)
df <- as.data.table(data.frame(num = c(1:10),change = rnorm(10)))
The new variable value
is defined as its own value in the previous period plus the change
in the current period. The value in the first observation is determined by an arbitrarily chosen initial_value
. If there were no restrictions on value
, it could be created simply as
df <- df[, value0 := initial_value + cumsum(change)]
This is very fast using data.table
. However, unfortunately, change
may also depend on the actual value
in the previous period. Specifically, let's assume that whenever it reaches 102, the series needs to get to the initial_value
in the next period and stay there for 3 periods. Thus, in the following data frame, I need to create the variable value
while the code above produced value0
:
num change value0 value
1: 1 -0.56047565 99.43952 99.43952
2: 2 -0.23017749 99.20935 99.20935
3: 3 1.55870831 100.76806 100.76806
4: 4 0.07050839 100.83856 100.83856
5: 5 0.12928774 100.96785 100.96785
6: 6 1.71506499 102.68292 102.68292
7: 7 0.46091621 103.14383 100.00000
8: 8 -1.26506123 101.87877 100.00000
9: 9 -0.68685285 101.19192 100.00000
10: 10 -0.44566197 100.74626 99.55434
So far the only way I managed to produce this result is using a loop:
df$value <- NA
df$value[1] <- initial_value + df$change[1]
for (i in 2:nrow(df)) {
if (is.na(df$value[i])) {
if (df$value[i-1] < 102) {
df$value[i] <- df$value[i-1] + df$change[i]
} else {
df$value[i:(i+2)] <- initial_value
}
}
}
However, looping over (dozens of) millions of observations is extremely slow. Is there a way to possibly vectorize it or simply run the process more efficiently?
I suggest you using Rcpp for simple loops. It's easy to replicate requested logic.
Your function:
fun_r <- function(){
df$value <- NA
df$value[1] <- initial_value + df$change[1]
for (i in 2:nrow(df)) {
if (is.na(df$value[i])) {
if (df$value[i-1] < 102) {
df$value[i] <- df$value[i-1] + df$change[i]
} else {
df$value[i:(i+2)] <- initial_value
}
}
}
df
}
Same function in c++
library(Rcpp)
cppFunction({'
NumericVector fun_c(NumericVector change, double init, double thr){
int n = change.size();
int end;
NumericVector out(n);
out[ 0 ] = init + change[ 0 ];
for(int i = 1; i < n; i++){
if( out[ i - 1 ] < thr ){
out[i] = out[ i - 1 ] + change[ i ];
} else {
end = std::min( i + 2 , n - 1);
for(int j = i; j <= end; j++) {
out[ j ] = init;
i = j;
}
}
}
return out;
}
'})
UPDATE:
R function written for the first time (above) is based on data.frame
subsetting, which is highly ineffective way to deal with data in R. Function is simply an underdog expected to lose in all benchmarks. While looping, one should always vectorize (vectors and matrix) computations. Below function which are more competetive with Rcpp example:
fun_r2 <- function(change, initial_value, thr ){
n <- length(change)
value <- numeric(n)
value[1] <- initial_value + change[1]
for (i in 2:n) {
if ( value[i]==0 ) {
if (value[i-1] < thr) {
value[i] <- value[i-1] + change[i]
} else {
value[i:(i+2)] <- initial_value
}
}
}
value
}
Three functions produces the same results, and fun_c
is the fastest, but vectorized fun_r2
function can be considered as acceptable.
df$value <- fun_r()
df$value_r2 <- fun_r2(as.vector(df$change), init=100, thr=102)
df$value_rcpp <- fun_c(df$change, init=100, thr=102)
all.equal(df$value, df$value_rcpp)
all.equal(df$value, df$value_r2)
# TRUE
mb <- microbenchmark::microbenchmark(
fun_r(),
fun_r2(as.vector(df$change), init=100, thr=102),
fun_c(df$change, init=100, thr=102),
times=100L
)
# expr mean
# 1 fun_r() 6650.72481
# 2 fun_r2() 42.28442
# 3 fun_c() 18.24121
Enjoy!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With