Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

iterate through data frame where each iteration is dependent on the previous item in R efficiently

I have a data frame with two vectors of length 5 and variable:

x <- seq(1:5)
y <- rep(0,5)
df <- data.frame(x, y)
z <- 10

I need to loop through the data frame and update y based on a condition related to x using z, and I need to update z at every iteration. Using a for loop, I would do this:

for (i in seq(2,nrow(df))){
  if(df$x[i] %% 2 == 0){
    df$y[i] <- df$y[i-1] + z
    z <- z - df$x[i]
  } else{
    df$y[i] <- df$y[i-1]
  }
}

Using data frames is slow and having to access the ith item using df$x[i] is not efficient, but I am unsure how to vectorize this since both y and z will change based on each iteration.

Does anyone have a recommendation on best way to iterate this? I was loking to avoide data frames completely and just use vectors so simplify the lookups, or use something from tidyverse using tibbles and the purrr package, but nothing seemed easy to implement. Thanks!

like image 822
George Avatar asked Feb 08 '18 22:02

George


2 Answers

you can use sapply function:

y=0
z=10
sapply(df$x,function(x)ifelse(x%%2==0,{y<<-y+z;z<<-z-x;y},y<<-y))
[1]  0 10 10 18 18
like image 132
KU99 Avatar answered Oct 07 '22 02:10

KU99


Here's a vectorized version

vec_fun <- function(x, z) {
    L <- length(x)

    vec_z <- rep(0, L)
    I <- seq(2, L, by=2)
    vec_z[I] <- head(z-c(0, cumsum(I)), length(I))

    cumsum(vec_z)
}

The alternative versions - sapply & tidyverse

sapply_fun <- function(x, z) {
    y=0
    sapply(df$x,function(x)ifelse(x%%2==0,{y<<-y+z;z<<-z-x;y},y<<-y))
}

library(tidyverse)
library(tidyverse)
tidy_fun <- function(df) {
    df %>% 
      filter(x %% 2 != 0) %>%
      mutate(z = accumulate(c(z, x[-1] - 1), `-`)) %>%
      right_join(df, by = c("x", "y")) %>%
      mutate(z = lag(z), z = ifelse(is.na(z), 0, z)) %>%
      mutate(y = cumsum(z)) %>%
      select(-z) %>%
      pluck("y")
}

Your data

df <- data.frame(x=1:5, y=0)
z <- 10

Let's make sure they all return the same result

identical(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df))
# TRUE

Benchmark with small dataset - sapply_fun appears to be slightly faster

library(microbenchmark)
microbenchmark(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df), times=100L, unit="relative")

# Unit: relative
                # expr        min         lq       mean     median         uq      max neval
    # vec_fun(df$x, z)   1.349053   1.316664   1.256691   1.359864   1.348181 1.146733   100
 # sapply_fun(df$x, z)   1.000000   1.000000   1.000000   1.000000   1.000000 1.000000   100
        # tidy_fun(df) 411.409355 378.459005 168.689084 301.029545 270.519170 4.244833   100

Now with larger data.frame

df <- data.frame(x=1:1000, y=0)
z <- 10000

Same result - yes

identical(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df))
# TRUE

Benchmark with larger dataset - now it's obvious vec_fun is faster

library(microbenchmark)
microbenchmark(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df), times=100L, unit="relative")

# Unit: relative
                # expr       min        lq      mean    median        uq     max neval
    # vec_fun(df$x, z)   1.00000   1.00000   1.00000   1.00000   1.00000   1.000   100
 # sapply_fun(df$x, z)  42.69696  37.00708  32.19552  35.19225  27.82914  27.285   100
        # tidy_fun(df) 259.87893 228.06417 201.43230 218.92552 172.45386 380.484   100
like image 39
CPak Avatar answered Oct 07 '22 02:10

CPak