Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run if loop in parallel

I have a data set with ~4 million rows that I need to loop over. The data structure is there are repeated IDs that are dependent on each other but data is independent across IDs. For each ID, the [i+1] row is a dependent on [i]. Here is a reproducible example. I do realize that this example is not practical in terms of the inner functions but it is simply a demonstration of the structure I have.

set.seed(123)

id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)

month = rep(seq(1,5),3)

x = round(rnorm(15,2,5))
y = rep(0,15)

df = as.data.frame(cbind(ids,month,x,y))

for (i in 1:nrow(df)){
  if(i>1 && df[i,1]==df[i-1,1]){
    #Main functions go here
    df[i,4] = df[i-1,4]^2+df[i,3]
  }
  else {
    df[i,4] = 1
  }
}

The issue is in reality 1000 loops of the real function takes ~90 seconds, so 4 million rows takes days. It isn't feasible for me to run this way. However the IDs are independent and don't need to run together. My question is: is there a way to run this type of loop in parallel? A very non-elegant solution would be to split the file into 50 sections without splitting an ID and simply run the same code on the 50 sub-files. I figure there should be a way to code this though.

EDIT: Added month column to show why the rows are dependent on each other. To address two comments below:

1) There are actually 6-7 lines of functions to run. Could I use ifelse() with multiple functions?
2) The desired output would be the full data frame. In reality there are more columns but I need each row in a data frame.

   ids month  x      y
1    1     1 -1      1
2    1     2  1      2
3    1     3 10     14
4    1     4  2    198
5    1     5  3  39207
6    2     1 11      1
7    2     2  4      5
8    2     3 -4     21
9    2     4 -1    440
10   2     5  0 193600
11   3     1  8      1
12   3     2  4      5
13   3     3  4     29
14   3     4  3    844
15   3     5 -1 712335

EDIT2: I've tried applying the foreach() package from another post but it doesn't seem to work. This code will run but I think the issue is the way that rows are distributed among cores. If each row is sequentially sent to a different core then the same ID will never be in the same core.

library(foreach)
library(doParallel)


set.seed(123)

id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)

month = rep(seq(1,5),3)

x = round(rnorm(15,2,5))
y = rep(0,15)

df = as.data.frame(cbind(ids,month,x,y))

#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)

finalMatrix <- foreach(i=1:nrow(df), .combine=cbind) %dopar% {

  for (i in 1:nrow(df)){
    if(i>1 && df[i,1]==df[i-1,1]){
      #Main functions go here
      df[i,4] = df[i-1,4]^2+df[i,3]
    }
    else {
      df[i,4] = 1
    }
  }
}
#stop cluster
stopCluster(cl)
like image 305
user137698 Avatar asked Apr 27 '26 06:04

user137698


1 Answers

So, simply recode your loop with Rcpp:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector fill_y(const NumericVector& x) {

  int n = x.length();
  NumericVector y(n); y[0] = 1;
  for (int i = 1; i < n; i++) {
    y[i] = pow(y[i - 1], 2) + x[i];
  }
  return y;
}

And, to apply it on each group, use dplyr:

df %>%
  group_by(ids) %>%
  mutate(y2 = fill_y(x))

I think this should be fast enough so that you don't need parallelism. Actually I ran it on @Val's testdat and it took only 2 seconds (with an old computer).

Tell me if it's okay. Otherwise, I'll make a parallel version.

like image 112
F. Privé Avatar answered Apr 29 '26 20:04

F. Privé



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!