How to speed up iteration while working on a data-frame with over 5 million observations in r?

Tags:

I am trying to generate values for over 7 variables across millions of observations and it's taking forever when I write a for loop to achieve this. Below is an example of what I am trying to achieve. In this case it's fast since it has only a few thousand observations:

# Load dplyr


library(tidyverse)
set.seed(50)

df <- data_frame(SlNo = 1:2000,
                 Scenario = rep(c(1, 2, 3, 4),500),
                 A = round(rnorm(2000, 11, 6)),
                 B = round(rnorm(2000, 15, 4))) %>%
      arrange(Scenario) 

#splitting data-frame to add multiple rows in the data-frame

df<- df %>% split(f = .$Scenario) %>%
  map_dfr(~bind_rows(tibble(Scenario = 0), .x)) 

#observations for certain variables in the newly added rows have specific values

df <- df %>% mutate(C = if_else(Scenario != 0, 0, 4),
                    E = if_else(Scenario != 0, 0, 6))

for(i in 2:nrow(df)) {

df$C[i] <- if_else(df$Scenario[i] != 0, (1-0.5) * df$C[i-1] + 3 + 2 + df$B[i] + df$E[i-1],
              df$C[i])
df$E[i] <- if_else(df$Scenario[i] != 0, df$C[i] + df$B[i] - 50, df$E[i])


}

df

# A tibble: 2,004 x 6
   Scenario  SlNo     A     B     C      E
      <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
 1        0    NA    NA    NA   4     6   
 2        1     1    14    19  32     1   
 3        1     5     1    13  35    -2   
 4        1     9    17    20  40.5  10.5 
 5        1    13     8     7  42.8  -0.25
 6        1    17    10    16  42.1   8.12
 7        1    21     9    12  46.2   8.19
 8        1    25    14    18  54.3  22.3 
 9        1    29    14    15  69.4  34.4 
10        1    33     4    17  91.1  58.1 
# ... with 1,994 more rows

I'd like to produce similar results quickly while working with larger data frames. I appreciate any help on this. Thank you in advance!!

235

asked Feb 12 '19 19:02

Dal

2 Answers

In tidyverse you may use purrr::accumulate like this

library(tidyverse)
set.seed(50)

df <- data.frame(SlNo = 1:2000,
                 Scenario = rep(c(1, 2, 3, 4),500),
                 A = round(rnorm(2000, 11, 6)),
                 B = round(rnorm(2000, 15, 4))) %>%
  arrange(Scenario)

df %>%
  nest(data = B) %>%
  group_by(Scenario) %>%
  mutate(new = accumulate(data, 
                          .init = tibble(C = 4, E = 6),
                          ~ tibble(C = (1 -0.5)* .x$C + 5 + .y$B + .x$E,
                                   E = 0.5 * .x$C + 5 + .x$E + 2 * .y$B - 50
                                   )
                          )[-1]
         ) %>% ungroup %>%
  unnest_wider(data) %>%
  unnest_wider(new)

#> # A tibble: 2,000 x 6
#>     SlNo Scenario     A     B     C     E
#>    <int>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1        1    14    19  32    1   
#>  2     5        1     1    13  35   -2   
#>  3     9        1    17    20  40.5 10.5 
#>  4    13        1     8     7  42.8 -0.25
#>  5    17        1    10    16  42.1  8.12
#>  6    21        1     9    12  46.2  8.19
#>  7    25        1    14    18  54.3 22.3 
#>  8    29        1    14    15  69.4 34.4 
#>  9    33        1     4    17  91.1 58.1 
#> 10    37        1    13    15 124.  88.7 
#> # ... with 1,990 more rows

^{Created on 2021-07-05 by the reprex package (v2.0.0)}

answered Sep 27 '22 19:09

AnilGoyal

If you don't want to transition to data.table, or dtplyr, which could be tricky in figuring out how to adapt cumsum and lag to your needed output, you could adapt your loop to be run in parallel, here an example of the code:

#install.packages("foreach")
#install.packages("doParallel")

# Loading libraries

library(foreach)
library(doParallel)
library(tidyverse)
set.seed(50)

df <- data_frame(SlNo = 1:2000,
                 Scenario = rep(c(1, 2, 3, 4),500),
                 A = round(rnorm(2000, 11, 6)),
                 B = round(rnorm(2000, 15, 4))) %>%
      arrange(Scenario) 

#splitting data-frame to add multiple rows in the data-frame

df<- df %>% split(f = .$Scenario) %>%
  map_dfr(~bind_rows(tibble(Scenario = 0), .x)) 

#observations for certain variables in the newly added rows have specific values

df <- df %>% mutate(C = if_else(Scenario != 0, 0, 4),
                    E = if_else(Scenario != 0, 0, 6))


# Setting up the cores
n.cores <- parallel::detectCores() - 1
my.cluster <- parallel::makeCluster(
        n.cores, 
        type = "PSOCK",
        .packages="dplyr"
)
doParallel::registerDoParallel(cl = my.cluster)

# Run the foreach loop in parallel
foreach(
        i = 2:nrow(df2), 
        .combine = 'rbind'
) %dopar% {
        df$C[i] <- if_else(df$Scenario[i] != 0, (1-0.5) * df$C[i-1] + 3 + 2 + df$B[i] + df$E[i-1],
                           df$C[i])
        df$E[i] <- if_else(df$Scenario[i] != 0, df$C[i] + df$B[i] - 50, df$E[i])
}
df
# stop the cluster
parallel::stopCluster(cl = my.cluster)

This should speed up your code significantly. However, not that time execution differences with parallel are evident over larger datasets, with small dataset it can actually take a bit more time to execute.

answered Sep 27 '22 19:09

Fabio Favoretto

Related questions
                            
                                Is it possible to include a bibliography in a revealjs presentation using rmarkdown?
                            
                                visNetwork with R: How to prevent nodes from overlapping with edges
                            
                                Can I test autocorrelation from the generalized least squares model?
                            
                                Why does R use radix sort?
                            
                                Can you change kernel in colab notebook?
                            
                                Color RDA vectors by groups in R
                            
                                R Shiny downloadHandler returns app html rather than plots or data
                            
                                map + pmap, cannot find variables
                            
                                Check differences of various DATE inside one variables R
                            
                                Applying a Word style (table or paragraph) to flextable object
                            
                                Different accuracy between python keras and keras in R
                            
                                How to collapse rows of a frequency table to add their counts in a new column?
                            
                                How to highlight all R function names with highlight.js?
                            
                                R Authentication error when looping search_tweets function (rtweet package) over vector of Twitter handles
                            
                                Read complex html file into R with rvest
                            
                                What is the fastest and/or better way to use getSymbols()? To get each equity separate or all together?
                            
                                mermaid diagrams: Adjust white space around graph
                            
                                Connecting to Microsoft SQL Server with R (view is in a database in Microsoft SQL Server Management Studio (SSMS)
                            
                                Left join with multiple conditions in R
                            
                                How to apply thick border around a cell range using the `openxlsx` package in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to speed up iteration while working on a data-frame with over 5 million observations in r?

Tags:

iteration

r

dplyr

purrr

tidyverse

Dal

People also ask

2 Answers

AnilGoyal

Fabio Favoretto

Recent Activity

Donate For Us