Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel processing in R for a Data Frame

I have a data frame like this:

                      Open   High    Low  Close Volume
1998-09-08 10:32:00 106.44 106.44 106.44 106.44      1
1998-09-08 10:33:00 106.42 106.42 106.35 106.35 628225
1998-09-08 10:34:00 106.31 106.38 106.31 106.38 135840
1998-09-08 10:35:00 106.35 106.35 106.32 106.34 170010
1998-09-08 10:36:00 106.35 106.36 106.35 106.36 309560
1998-09-08 10:37:00 106.44 106.50 106.44 106.50 115540
1998-09-08 10:38:00 106.49 106.53 106.49 106.52 427620
1998-09-08 10:39:00 106.53 106.54 106.52 106.53 321350
1998-09-08 10:40:00 106.55 106.60 106.54 106.54 317647
1998-09-08 10:41:00 106.56 106.63 106.56 106.63 233901

I need to change Open in a parallel processing. I wrote a function like this :

parTest <- function(x){

          foreach(i = 1:nrow(x)) %dopar% {                   
                       x[i,1] <- i
                }
return(x)        
}

but when I call this function nothing change and it return unchanged data frame.

zz <- parTest (x)
zz 

When I use simple for loop it works but foreach do not work !

I also used appropriate package and cores setting as well:

library(foreach)
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)

Thanks for your help.

like image 647
Yaser Abolghasemi Avatar asked Jan 08 '23 12:01

Yaser Abolghasemi


1 Answers

foreach will take the return value from the code block and somehow combine it. In your case, since you do not specify the .combine argument, it is returning each instance within a list. (The first paragraph of help(foreach) says this.)

Okay, so what is happening with each instantiation of your code block? It is taking a view of the data.frame from when the call was started (meaning row 2 does not see the changed data.frame from row 1, etc), updating this data.frame, and then returning "something".

This "something" is not what you think it should be. To see this, try manually updating the data.frame with something like (x[1,1] <- 1); this is showing what the return value from the assignment is the value "1", not the contents of x. In other words, the return value from an assignment is the value assigned, not the whole variable to which it was assigned.

So, in your case, x[i,1] <- i is silently return i, so the returned value from the child processes of foreach (which you are not capturing) is a list of 1:nrow(x), useless to you. If you assigned the result from foreach and explicitly returned it from the foreach code block, you would see this.

What I think you want is for the code block to return the specific row that has been adjusted, and then combine them into a data.frame at the end. Note, if you return the whole data.frame, then the return from foreach will be a list of data.frames, not (I think) what you want.

There are many ways to do this, I'll show three. This first one will work just fine, and it's a little more literal in how you are managing the data.frame.

parTest <- function(x) {
    ret <- foreach(i = 1:nrow(x)) %dopar% {
        x[i,1] <- i
        x[i,,drop=FALSE]
    }
    do.call('rbind', ret)
}

If your data.frame is rather large, realize you are making a lot of copies of this data.frame. If you only need one row (I'm assuming your example is contrived as a simple MWE), then this is unnecessary. You can simplify this a little with:

parTest <- function(x) {
    foreach(i = 1:nrow(x), .combine=rbind) %dopar% {
        x[i,1] <- i
        x[i,,drop=FALSE]
    }
}

Another technique, using the iterators package:

library(iterators)
parTest <- function(x) {
    foreach(df = iter(x, by='row'), .combine=rbind) %dopar% {
        df[,1] <- 1
        df
    }
}

This latter technique seems to me to be a little more readable. And, if you really only care about a single row at a time, it may perform faster than the other.

BTW: I'm assuming that you are really looking for the resulting data.frame, not specifically for the side-effect of changing the data.frame in the current environment. When dealing with parallel stuff using %dopar%, realize that the child processes do not get to see or work with the actual calling environment.

like image 127
r2evans Avatar answered Jan 18 '23 01:01

r2evans