I do many SQL-queries to an industry-strength database far away, but it takes long time to receive the results. It was much faster when my computer with R was almost next to the database, which leads me to believe that it is the latency between my computer and the database that is the bottleneck, and that running parallel queries might speed things up. We are on different continents.
Here is a working version that is not in parallel:
doQueries <- function(filenameX, inp1, inp2) {
print(paste("Starting:", inp1, inp2, ",saving to", filenameX, sep=" "))
# Here should the query be (using RODBC)
# save(queryresults, file="filenameX")
}
input.rows <- cbind(c("file1.rda","file2.rda","file3.rda"),c("A","B","C"),c(12,13,14))
for (i in 1:nrow(input.rows)) {
doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3])
}
I have tried with the following code, but the foreach-library do not seem to be avaliable and as I understand from reading on CRAN, parallel is replacing earlier packages for parallelling ("package ‘foreach’ is not available (for R version 2.15.0)").
library(parallel)
library(foreach)
foreach (i=1:nrow(input.rows)) %dopar% {
doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3])
}
How should I do this instead?
Thanks to all contributors on Stackoverflow!
/Chris
Update: Thanks to nograpes, I managed to load the libraries. The following code seems to work:
library(RODBC)
library(doParallel)
library(foreach)
# odbcCloseAll()
# my_conn <- odbcConnect("database", uid="xx", pwd="yy", case="nochange")
doQueries <- function(filenameX, inp1, inp2) {
print(paste("Starting:", inp1, inp2, ",saving to", filenameX, sep=" "))
# sql.test <- RODBC::sqlQuery(my_conn, "SELECT * FROM zzz LIMIT 100", rows_at_time=1024)
# save(sql.test, file="filenameX")
}
input.rows <- cbind(c("file1.rda","file2.rda","file3.rda"),c("A","B","C"),c(12,13,14))
cl <- makeCluster(3)
registerDoParallel(cl)
foreach (i=1:nrow(input.rows)) %dopar% {
doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3])
}
stopCluster(cl)
But when I include the actual SQL-query, this error-message appears: Error in { : task 1 failed - "first argument is not an open RODBC channel"
Could it be so that this conceptually will not work? That RODBC cannot handle more than one query at a time?
I really appreciate all support.
/Chris
Update 2: Thanks a lot nograpes for the very good and impressive answers. It is difficult to judge if the data transfers themselves are faster (I think about 20% faster total throughput), but I found that as the queries (about 100) have different response-times, and need postprocessing (which I include in the function before saving), I get a better utilization of the link and local CPU together. I.e. with just one query at the time, the CPUs will be almost unused during the data transfer, and then the link will be quiet while the CPUs are working. With parallel queries, I see data arriving and the CPUs working at the same time. In total it became much faster. Thanks a lot!
/Chris
Definition. Parallel loops are one of the most widely used concepts to express parallelism in parallel languages and libraries. In general, a parallel loop is a loop whose iterations are executed at least partially concurrently by several threads or processes.
There are multiple overloaded versions of the Parallel For loop available in C#. In our example, we use the following overloaded versions. public static ParallelLoopResult For(int fromInclusive, int toExclusive, Action<int> body): This method is used to execute a for loop in which iterations may run in parallel.
Task parallelism is the process of running tasks in parallel. Task parallelism divides tasks and allocates those tasks to separate threads for processing. It is based on unstructured parallelism. It means the parallel work unit may start and finish in places scattered according to the the executing of the program.
Use the asyncio Module to Parallelize the for Loop in PythonThe loop also runs in parallel with the main function.
As I mentioned in my comment, this technique will probably not be faster. To answer your question, the foreach
package is available for your version of R. Perhaps your selected repository hasn't been updated yet. Try this:
install.packages('foreach', repos='http://cran.us.r-project.org')
which should install the package. If that doesn't work, grab the binary for your OS here, and just install it through the menus.
If the bottleneck is the network connection, then you can only speed up the process by reducing the amount of stuff you put on the network. One idea would be to remotely connect to the database server, have it dump the query to a file (on the server), compress it, and then download it to your computer, then have R uncompress and load it. It sounds like a lot, but you can probably do the entire process within R.
Following up on your update, it appears that you did not include a .packages
argument in your foreach
statement. This is why you had to prefix the sqlQuery
function with RODBC::
. It is necessary to specify what packages the loop needs, because I think it essentially starts a new R session for each node, and each session needs to be initialized with the packages. Similarly, you can't access my_conn
because it was outside of the loop, you need to create inside the loop so every node has their own copy.
library(RODBC)
library(foreach)
library(doParallel)
setwd('C:/Users/x/Desktop')
doQueries <- function(filenameX) {
sql.text<-sqlQuery(my_conn, 'SELECT * FROM table;')
save(sql.text, file=filenameX)
}
cl <- makeCluster(2)
registerDoParallel(cl)
foreach (i=1:2, .packages='RODBC') %dopar% {
my_conn <- odbcConnect("db", uid="user", pwd="pass")
doQueries(filenameX=paste('file_',i,sep=''))
}
But, like I mentioned, this probably won't be any faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With