Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up iteration while working on a data-frame with over 5 million observations in r?

I am trying to generate values for over 7 variables across millions of observations and it's taking forever when I write a for loop to achieve this. Below is an example of what I am trying to achieve. In this case it's fast since it has only a few thousand observations:

# Load dplyr


library(tidyverse)
set.seed(50)

df <- data_frame(SlNo = 1:2000,
                 Scenario = rep(c(1, 2, 3, 4),500),
                 A = round(rnorm(2000, 11, 6)),
                 B = round(rnorm(2000, 15, 4))) %>%
      arrange(Scenario) 

#splitting data-frame to add multiple rows in the data-frame

df<- df %>% split(f = .$Scenario) %>%
  map_dfr(~bind_rows(tibble(Scenario = 0), .x)) 

#observations for certain variables in the newly added rows have specific values

df <- df %>% mutate(C = if_else(Scenario != 0, 0, 4),
                    E = if_else(Scenario != 0, 0, 6))

for(i in 2:nrow(df)) {

df$C[i] <- if_else(df$Scenario[i] != 0, (1-0.5) * df$C[i-1] + 3 + 2 + df$B[i] + df$E[i-1],
              df$C[i])
df$E[i] <- if_else(df$Scenario[i] != 0, df$C[i] + df$B[i] - 50, df$E[i])


}

df

# A tibble: 2,004 x 6
   Scenario  SlNo     A     B     C      E
      <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
 1        0    NA    NA    NA   4     6   
 2        1     1    14    19  32     1   
 3        1     5     1    13  35    -2   
 4        1     9    17    20  40.5  10.5 
 5        1    13     8     7  42.8  -0.25
 6        1    17    10    16  42.1   8.12
 7        1    21     9    12  46.2   8.19
 8        1    25    14    18  54.3  22.3 
 9        1    29    14    15  69.4  34.4 
10        1    33     4    17  91.1  58.1 
# ... with 1,994 more rows

I'd like to produce similar results quickly while working with larger data frames. I appreciate any help on this. Thank you in advance!!

like image 235
Dal Avatar asked Feb 12 '19 19:02

Dal


People also ask

Why is my R code taking so long?

There is a lot of overhead in the processing because R needs to check the type of a variable nearly every time it looks at it. This makes it easy to change types and reuse variable names, but slows down computation for very repetitive tasks, like performing an action in a loop.

WHY ARE FOR loops slow in R?

Loops are slower in R than in C++ because R is an interpreted language (not compiled), even if now there is just-in-time (JIT) compilation in R (>= 3.4) that makes R loops faster (yet, still not as fast). Then, R loops are not that bad if you don't use too many iterations (let's say not more than 100,000 iterations).


2 Answers

In tidyverse you may use purrr::accumulate like this

library(tidyverse)
set.seed(50)

df <- data.frame(SlNo = 1:2000,
                 Scenario = rep(c(1, 2, 3, 4),500),
                 A = round(rnorm(2000, 11, 6)),
                 B = round(rnorm(2000, 15, 4))) %>%
  arrange(Scenario)

df %>%
  nest(data = B) %>%
  group_by(Scenario) %>%
  mutate(new = accumulate(data, 
                          .init = tibble(C = 4, E = 6),
                          ~ tibble(C = (1 -0.5)* .x$C + 5 + .y$B + .x$E,
                                   E = 0.5 * .x$C + 5 + .x$E + 2 * .y$B - 50
                                   )
                          )[-1]
         ) %>% ungroup %>%
  unnest_wider(data) %>%
  unnest_wider(new)

#> # A tibble: 2,000 x 6
#>     SlNo Scenario     A     B     C     E
#>    <int>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1        1    14    19  32    1   
#>  2     5        1     1    13  35   -2   
#>  3     9        1    17    20  40.5 10.5 
#>  4    13        1     8     7  42.8 -0.25
#>  5    17        1    10    16  42.1  8.12
#>  6    21        1     9    12  46.2  8.19
#>  7    25        1    14    18  54.3 22.3 
#>  8    29        1    14    15  69.4 34.4 
#>  9    33        1     4    17  91.1 58.1 
#> 10    37        1    13    15 124.  88.7 
#> # ... with 1,990 more rows

Created on 2021-07-05 by the reprex package (v2.0.0)

like image 66
AnilGoyal Avatar answered Sep 27 '22 19:09

AnilGoyal


If you don't want to transition to data.table, or dtplyr, which could be tricky in figuring out how to adapt cumsum and lag to your needed output, you could adapt your loop to be run in parallel, here an example of the code:

#install.packages("foreach")
#install.packages("doParallel")

# Loading libraries

library(foreach)
library(doParallel)
library(tidyverse)
set.seed(50)

df <- data_frame(SlNo = 1:2000,
                 Scenario = rep(c(1, 2, 3, 4),500),
                 A = round(rnorm(2000, 11, 6)),
                 B = round(rnorm(2000, 15, 4))) %>%
      arrange(Scenario) 

#splitting data-frame to add multiple rows in the data-frame

df<- df %>% split(f = .$Scenario) %>%
  map_dfr(~bind_rows(tibble(Scenario = 0), .x)) 

#observations for certain variables in the newly added rows have specific values

df <- df %>% mutate(C = if_else(Scenario != 0, 0, 4),
                    E = if_else(Scenario != 0, 0, 6))


# Setting up the cores
n.cores <- parallel::detectCores() - 1
my.cluster <- parallel::makeCluster(
        n.cores, 
        type = "PSOCK",
        .packages="dplyr"
)
doParallel::registerDoParallel(cl = my.cluster)

# Run the foreach loop in parallel
foreach(
        i = 2:nrow(df2), 
        .combine = 'rbind'
) %dopar% {
        df$C[i] <- if_else(df$Scenario[i] != 0, (1-0.5) * df$C[i-1] + 3 + 2 + df$B[i] + df$E[i-1],
                           df$C[i])
        df$E[i] <- if_else(df$Scenario[i] != 0, df$C[i] + df$B[i] - 50, df$E[i])
}
df
# stop the cluster
parallel::stopCluster(cl = my.cluster)

This should speed up your code significantly. However, not that time execution differences with parallel are evident over larger datasets, with small dataset it can actually take a bit more time to execute.

like image 41
Fabio Favoretto Avatar answered Sep 27 '22 19:09

Fabio Favoretto