I am trying to generate values for over 7 variables across millions of observations and it's taking forever when I write a for loop to achieve this. Below is an example of what I am trying to achieve. In this case it's fast since it has only a few thousand observations:
# Load dplyr
library(tidyverse)
set.seed(50)
df <- data_frame(SlNo = 1:2000,
Scenario = rep(c(1, 2, 3, 4),500),
A = round(rnorm(2000, 11, 6)),
B = round(rnorm(2000, 15, 4))) %>%
arrange(Scenario)
#splitting data-frame to add multiple rows in the data-frame
df<- df %>% split(f = .$Scenario) %>%
map_dfr(~bind_rows(tibble(Scenario = 0), .x))
#observations for certain variables in the newly added rows have specific values
df <- df %>% mutate(C = if_else(Scenario != 0, 0, 4),
E = if_else(Scenario != 0, 0, 6))
for(i in 2:nrow(df)) {
df$C[i] <- if_else(df$Scenario[i] != 0, (1-0.5) * df$C[i-1] + 3 + 2 + df$B[i] + df$E[i-1],
df$C[i])
df$E[i] <- if_else(df$Scenario[i] != 0, df$C[i] + df$B[i] - 50, df$E[i])
}
df
# A tibble: 2,004 x 6
Scenario SlNo A B C E
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0 NA NA NA 4 6
2 1 1 14 19 32 1
3 1 5 1 13 35 -2
4 1 9 17 20 40.5 10.5
5 1 13 8 7 42.8 -0.25
6 1 17 10 16 42.1 8.12
7 1 21 9 12 46.2 8.19
8 1 25 14 18 54.3 22.3
9 1 29 14 15 69.4 34.4
10 1 33 4 17 91.1 58.1
# ... with 1,994 more rows
I'd like to produce similar results quickly while working with larger data frames. I appreciate any help on this. Thank you in advance!!
There is a lot of overhead in the processing because R needs to check the type of a variable nearly every time it looks at it. This makes it easy to change types and reuse variable names, but slows down computation for very repetitive tasks, like performing an action in a loop.
Loops are slower in R than in C++ because R is an interpreted language (not compiled), even if now there is just-in-time (JIT) compilation in R (>= 3.4) that makes R loops faster (yet, still not as fast). Then, R loops are not that bad if you don't use too many iterations (let's say not more than 100,000 iterations).
In tidyverse
you may use purrr::accumulate
like this
library(tidyverse)
set.seed(50)
df <- data.frame(SlNo = 1:2000,
Scenario = rep(c(1, 2, 3, 4),500),
A = round(rnorm(2000, 11, 6)),
B = round(rnorm(2000, 15, 4))) %>%
arrange(Scenario)
df %>%
nest(data = B) %>%
group_by(Scenario) %>%
mutate(new = accumulate(data,
.init = tibble(C = 4, E = 6),
~ tibble(C = (1 -0.5)* .x$C + 5 + .y$B + .x$E,
E = 0.5 * .x$C + 5 + .x$E + 2 * .y$B - 50
)
)[-1]
) %>% ungroup %>%
unnest_wider(data) %>%
unnest_wider(new)
#> # A tibble: 2,000 x 6
#> SlNo Scenario A B C E
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 14 19 32 1
#> 2 5 1 1 13 35 -2
#> 3 9 1 17 20 40.5 10.5
#> 4 13 1 8 7 42.8 -0.25
#> 5 17 1 10 16 42.1 8.12
#> 6 21 1 9 12 46.2 8.19
#> 7 25 1 14 18 54.3 22.3
#> 8 29 1 14 15 69.4 34.4
#> 9 33 1 4 17 91.1 58.1
#> 10 37 1 13 15 124. 88.7
#> # ... with 1,990 more rows
Created on 2021-07-05 by the reprex package (v2.0.0)
If you don't want to transition to data.table
, or dtplyr
, which could be tricky in figuring out how to adapt cumsum
and lag
to your needed output, you could adapt your loop to be run in parallel, here an example of the code:
#install.packages("foreach")
#install.packages("doParallel")
# Loading libraries
library(foreach)
library(doParallel)
library(tidyverse)
set.seed(50)
df <- data_frame(SlNo = 1:2000,
Scenario = rep(c(1, 2, 3, 4),500),
A = round(rnorm(2000, 11, 6)),
B = round(rnorm(2000, 15, 4))) %>%
arrange(Scenario)
#splitting data-frame to add multiple rows in the data-frame
df<- df %>% split(f = .$Scenario) %>%
map_dfr(~bind_rows(tibble(Scenario = 0), .x))
#observations for certain variables in the newly added rows have specific values
df <- df %>% mutate(C = if_else(Scenario != 0, 0, 4),
E = if_else(Scenario != 0, 0, 6))
# Setting up the cores
n.cores <- parallel::detectCores() - 1
my.cluster <- parallel::makeCluster(
n.cores,
type = "PSOCK",
.packages="dplyr"
)
doParallel::registerDoParallel(cl = my.cluster)
# Run the foreach loop in parallel
foreach(
i = 2:nrow(df2),
.combine = 'rbind'
) %dopar% {
df$C[i] <- if_else(df$Scenario[i] != 0, (1-0.5) * df$C[i-1] + 3 + 2 + df$B[i] + df$E[i-1],
df$C[i])
df$E[i] <- if_else(df$Scenario[i] != 0, df$C[i] + df$B[i] - 50, df$E[i])
}
df
# stop the cluster
parallel::stopCluster(cl = my.cluster)
This should speed up your code significantly. However, not that time execution differences with parallel are evident over larger datasets, with small dataset it can actually take a bit more time to execute.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With