I have a data frame with two vectors of length 5 and variable:
x <- seq(1:5)
y <- rep(0,5)
df <- data.frame(x, y)
z <- 10
I need to loop through the data frame and update y based on a condition related to x using z, and I need to update z at every iteration. Using a for loop, I would do this:
for (i in seq(2,nrow(df))){
if(df$x[i] %% 2 == 0){
df$y[i] <- df$y[i-1] + z
z <- z - df$x[i]
} else{
df$y[i] <- df$y[i-1]
}
}
Using data frames is slow and having to access the ith item using df$x[i] is not efficient, but I am unsure how to vectorize this since both y and z will change based on each iteration.
Does anyone have a recommendation on best way to iterate this? I was loking to avoide data frames completely and just use vectors so simplify the lookups, or use something from tidyverse using tibbles and the purrr package, but nothing seemed easy to implement. Thanks!
you can use sapply
function:
y=0
z=10
sapply(df$x,function(x)ifelse(x%%2==0,{y<<-y+z;z<<-z-x;y},y<<-y))
[1] 0 10 10 18 18
Here's a vectorized version
vec_fun <- function(x, z) {
L <- length(x)
vec_z <- rep(0, L)
I <- seq(2, L, by=2)
vec_z[I] <- head(z-c(0, cumsum(I)), length(I))
cumsum(vec_z)
}
The alternative versions - sapply
& tidyverse
sapply_fun <- function(x, z) {
y=0
sapply(df$x,function(x)ifelse(x%%2==0,{y<<-y+z;z<<-z-x;y},y<<-y))
}
library(tidyverse)
library(tidyverse)
tidy_fun <- function(df) {
df %>%
filter(x %% 2 != 0) %>%
mutate(z = accumulate(c(z, x[-1] - 1), `-`)) %>%
right_join(df, by = c("x", "y")) %>%
mutate(z = lag(z), z = ifelse(is.na(z), 0, z)) %>%
mutate(y = cumsum(z)) %>%
select(-z) %>%
pluck("y")
}
Your data
df <- data.frame(x=1:5, y=0)
z <- 10
Let's make sure they all return the same result
identical(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df))
# TRUE
Benchmark with small dataset - sapply_fun
appears to be slightly faster
library(microbenchmark)
microbenchmark(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df), times=100L, unit="relative")
# Unit: relative
# expr min lq mean median uq max neval
# vec_fun(df$x, z) 1.349053 1.316664 1.256691 1.359864 1.348181 1.146733 100
# sapply_fun(df$x, z) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
# tidy_fun(df) 411.409355 378.459005 168.689084 301.029545 270.519170 4.244833 100
Now with larger data.frame
df <- data.frame(x=1:1000, y=0)
z <- 10000
Same result - yes
identical(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df))
# TRUE
Benchmark with larger dataset - now it's obvious vec_fun
is faster
library(microbenchmark)
microbenchmark(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df), times=100L, unit="relative")
# Unit: relative
# expr min lq mean median uq max neval
# vec_fun(df$x, z) 1.00000 1.00000 1.00000 1.00000 1.00000 1.000 100
# sapply_fun(df$x, z) 42.69696 37.00708 32.19552 35.19225 27.82914 27.285 100
# tidy_fun(df) 259.87893 228.06417 201.43230 218.92552 172.45386 380.484 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With