Parallel for-loop in Windows

Tags:

r

I do many SQL-queries to an industry-strength database far away, but it takes long time to receive the results. It was much faster when my computer with R was almost next to the database, which leads me to believe that it is the latency between my computer and the database that is the bottleneck, and that running parallel queries might speed things up. We are on different continents.

Here is a working version that is not in parallel:

doQueries <- function(filenameX, inp1, inp2) {
  print(paste("Starting:", inp1, inp2, ",saving to", filenameX, sep=" "))
  # Here should the query be (using RODBC)
  # save(queryresults, file="filenameX")
}

input.rows <- cbind(c("file1.rda","file2.rda","file3.rda"),c("A","B","C"),c(12,13,14))

for (i in 1:nrow(input.rows)) {
  doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3])
}

I have tried with the following code, but the foreach-library do not seem to be avaliable and as I understand from reading on CRAN, parallel is replacing earlier packages for parallelling ("package ‘foreach’ is not available (for R version 2.15.0)").

library(parallel)
library(foreach)
foreach (i=1:nrow(input.rows)) %dopar% {
  doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3])
}

How should I do this instead?

Thanks to all contributors on Stackoverflow!

/Chris

Update: Thanks to nograpes, I managed to load the libraries. The following code seems to work:

library(RODBC)
library(doParallel)
library(foreach)

# odbcCloseAll()
# my_conn <- odbcConnect("database", uid="xx", pwd="yy", case="nochange")

doQueries <- function(filenameX, inp1, inp2) {
  print(paste("Starting:", inp1, inp2, ",saving to", filenameX, sep=" "))
  # sql.test <- RODBC::sqlQuery(my_conn, "SELECT * FROM zzz LIMIT 100", rows_at_time=1024)
  # save(sql.test, file="filenameX")
}

input.rows <- cbind(c("file1.rda","file2.rda","file3.rda"),c("A","B","C"),c(12,13,14))

cl <- makeCluster(3)
registerDoParallel(cl)

foreach (i=1:nrow(input.rows)) %dopar% {
  doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3])
}

stopCluster(cl)

But when I include the actual SQL-query, this error-message appears: Error in { : task 1 failed - "first argument is not an open RODBC channel"

Could it be so that this conceptually will not work? That RODBC cannot handle more than one query at a time?

I really appreciate all support.

/Chris

Update 2: Thanks a lot nograpes for the very good and impressive answers. It is difficult to judge if the data transfers themselves are faster (I think about 20% faster total throughput), but I found that as the queries (about 100) have different response-times, and need postprocessing (which I include in the function before saving), I get a better utilization of the link and local CPU together. I.e. with just one query at the time, the CPUs will be almost unused during the data transfer, and then the link will be quiet while the CPUs are working. With parallel queries, I see data arriving and the CPUs working at the same time. In total it became much faster. Thanks a lot!

/Chris

971

asked May 15 '12 15:05

Chris

1 Answers

As I mentioned in my comment, this technique will probably not be faster. To answer your question, the foreach package is available for your version of R. Perhaps your selected repository hasn't been updated yet. Try this:

install.packages('foreach', repos='http://cran.us.r-project.org')

which should install the package. If that doesn't work, grab the binary for your OS here, and just install it through the menus.

If the bottleneck is the network connection, then you can only speed up the process by reducing the amount of stuff you put on the network. One idea would be to remotely connect to the database server, have it dump the query to a file (on the server), compress it, and then download it to your computer, then have R uncompress and load it. It sounds like a lot, but you can probably do the entire process within R.

Following up on your update, it appears that you did not include a .packages argument in your foreach statement. This is why you had to prefix the sqlQuery function with RODBC::. It is necessary to specify what packages the loop needs, because I think it essentially starts a new R session for each node, and each session needs to be initialized with the packages. Similarly, you can't access my_conn because it was outside of the loop, you need to create inside the loop so every node has their own copy.

library(RODBC)
library(foreach)
library(doParallel)
setwd('C:/Users/x/Desktop')

doQueries <- function(filenameX) {
  sql.text<-sqlQuery(my_conn, 'SELECT * FROM table;')
  save(sql.text, file=filenameX)
}

cl <- makeCluster(2)
registerDoParallel(cl)

foreach (i=1:2, .packages='RODBC') %dopar% {
  my_conn <- odbcConnect("db", uid="user", pwd="pass")
  doQueries(filenameX=paste('file_',i,sep=''))
}

But, like I mentioned, this probably won't be any faster.

160

answered Oct 13 '22 23:10

nograpes

Related questions
                            
                                Increase the width of matrix printout
                            
                                How do I apply a multi-parameter function in R?
                            
                                R 2.14 byte compile - not possible with install.packages?
                            
                                layout inside layout in R
                            
                                Problems importing csv file/converting from integer to double in R
                            
                                Removing the margin and change the font style for labels in ggplot
                            
                                Left justify a column using textplot (gplots or PerformanceAnalytics)
                            
                                R - Get formals from call object
                            
                                Example of Using an S3 Class in a S4 Object
                            
                                quantmod barChart (or chartSeries) formatting options
                            
                                Interval sets algebra in R (union, intersection, differences, inclusion, ...)
                            
                                R / Sweave arguments
                            
                                Strange error of Hierarchical Clustering in R
                            
                                How to convert R expressions to LaTeX/TeX without using a CAS?
                            
                                how to add a general grid to a lattice xy.plot
                            
                                Matrix Multiplication in r
                            
                                Weighted mean by row
                            
                                loess line not plotting correctly
                            
                                Adding error bars to a barchart with multiple groups
                            
                                Name of a package for a given function in R [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallel for-loop in Windows

Tags:

r

Chris

People also ask

1 Answers

nograpes

Recent Activity

Donate For Us