I do use dplyr for almost (if not all) my data handling but I always struggle with one thing in R: recursive calculations.
Context: I have a sorted data frame storing items with ID
(thus a "group" notion) with some VALUES
. Some of them are missing but can be calculated iteratively using a coefficient COEFF
. I am looking for a simple and elegant way to do that (without a loop). Any clues?
Note: We assume that there is always a first non NA value for each ID
.
Below a reproductible example with expected solution:
df <- data.frame(ID = rep(letters[1:2], each = 5),
VALUE = c(1, 3, NA, NA, NA, 2, 2, 3, NA, NA),
COEFF = c(1, 2, 1, .5, 100, 1, 1, 1, 1, 1)
)
df_full <- df
# SOLUTION 1: Loop
for(i in 1:nrow(df_full))
{
if(is.na(df_full$VALUE[i])){
df_full$VALUE[i] <- df_full$VALUE[i-1]*df_full$COEFF[i]
}
}
df_full
# ID VALUE COEFF
#1 a 1.0 1.0
#2 a 3.0 2.0
#3 a 3.0 1.0
#4 a 1.5 0.5
#5 a 150.0 100.0
#6 b 2.0 1.0
#7 b 2.0 1.0
#8 b 3.0 1.0
#9 b 3.0 1.0
#10 b 3.0 1.0
# PSEUDO-SOLUTION 2: using Reduce()
# I struggle to apply this approach for each "ID", like we could do in dplyr using dplyr::group_by()
# Exemple for the first ID:
Reduce(function(v, x) x*v, x = df$COEFF[3:5], init = df$VALUE[2], accumulate = TRUE)
# PSEUDO-SOLUTION 3: dplyr::lag()
# We could think that we just have to use the lag() function to get the previous value, like such:
df %>%
mutate(VALUE = ifelse(is.na(VALUE), lag(VALUE) * COEFF, VALUE))
# but lag() is not "refreshed" after each calculation, it basically takes a copy of the VALUE column at the begining and adjust indexes.
A fully recursive way :
calc <- function(val,coef){
for(i in 2:length(val))
{
if(is.na(val[i])){
val[i] <- val[i-1]*coef[i]
}
}
return(val)
}
library(dplyr)
df %>%
group_by(ID) %>%
mutate(newval = calc(VALUE, COEFF))
ID VALUE COEFF newval
<chr> <dbl> <dbl> <dbl>
1 a 1 1 1
2 a 3 2 3
3 a NA 1 3
4 a NA 0.5 1.5
5 a NA 100 150
6 b 2 1 2
7 b 2 1 2
8 b 3 1 3
9 b NA 1 3
10 b NA 1 3
group_by
provides to mutate
a subset of the original data fields for each ID.
You can then process these vectors in a standard recursive loop and give back a result vector of equal length to the mutate
statement to put the results together.
If you need speed, the for-loop can easily be accelerated with Rcpp
:
library(Rcpp)
Rcpp::cppFunction('
NumericVector calc(NumericVector val, NumericVector coef) {
int n = val.size();
int i;
for (i = 1;i<n;i++){
if(R_IsNA(val[i])){
val[i] = val[i-1]*coef[i];
}
}
return val;
}')
I think you can probably get what you need here with a mix of tidyr::fill
to fill NA
values from above, combined with cumprod
to get the cumulative effect of multiplying by the coefficient, and ifelse
to choose when to use it. There's also a "working" column named V which is created and destroyed in the process.
library(dplyr)
df %>%
mutate(V = tidyr::fill(df, VALUE)$VALUE) %>%
group_by(ID) %>%
mutate(VALUE = ifelse(is.na(VALUE),
V * cumprod(ifelse(is.na(VALUE), COEFF, 1)),
VALUE)) %>% select(-V)
#> # A tibble: 10 x 3
#> # Groups: ID [2]
#> ID VALUE COEFF
#> <fct> <dbl> <dbl>
#> 1 a 1 1
#> 2 a 3 2
#> 3 a 3 1
#> 4 a 1.5 0.5
#> 5 a 150 100
#> 6 b 2 1
#> 7 b 2 1
#> 8 b 3 1
#> 9 b 3 1
#> 10 b 3 1
Created on 2020-06-30 by the reprex package (v0.3.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With