Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallelizing SQL queries in R

I have six SQL queries that I script though R that each take a very long time (~30 minutes each). Once each query returns I then manipulate the data for some standard reports.

What I'd like to do is use my multicore machine to run these SQL requests in parallel from R.

I'm on a Windows machine with a Oracle DB. I was following a blog post to use doSNOW and foreach to try and split these requests and this is the best thing I can find on stackoverflow.

I've been able to get the process to work for the non-parallel %do% version of foreach but not the %dopar%. With %dopar% it just returns an empty set. Below is code that sets up tables and runs the queries so you can see what happens. Apologies in advance if there's too much basic code.

I've looked at some of the other R packages but didn't see an obvious solution. Also if you have a better way to manage this kind of process I'd be interested to hear it - just keep in mind I'm an analyst not a computer scientist. Thanks!

#Creating a cluster
library(doSNOW)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl)

#Connecting to database through RODBC
ch=odbcConnect("",pwd = "xxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)

#Creating database tables for example purposes
qryA1 <- "create table temptable(test int)" 
qryA2 <- "insert into temptable(test) values((1))" 
qryA3 <- "select * from temptable" 
qryA4 <- "drop table temptable" 
qryB1 <- "create table temptable2(test int)" 
qryB2 <- "insert into temptable2(test) values((2))" 
qryB3 <- "select * from temptable2" 
qryB4 <- "drop table temptable2"  

sqlQuery(ch, qryA1) 
sqlQuery(ch, qryA2) 
doesItWork <- sqlQuery(ch, qryA3) 
doesItWork
sqlQuery(ch, qryB1) 
sqlQuery(ch, qryB2) 
doesItWork <- sqlQuery(ch, qryB3) 
doesItWork

result = c()
output = c()
databases <- list('temptable','temptable2')


#Non-parallel version of foreach
system.time(
foreach(i = 1:2)%do%{
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
output[i] = result
}
) 

output

#Parallel version of foreach

outputPar = c()

system.time(
foreach(i = 1:2)%dopar%{
#Connecting to database through RODBC
ch=odbcConnect(dsn ,pwd = "xxxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
outputPar[i] = result
}
) 

outputPar

sqlQuery(ch, qryA4)
sqlQuery(ch, qryB4) 
like image 724
Andrew Elliott Avatar asked Mar 29 '12 19:03

Andrew Elliott


People also ask

Can we write SQL query in R?

Did you know that you can run SQL code in an R Notebook code chunk? To use SQL, open an R Notebook in the RStudio IDE under the File > New File menu. Start a new code chunk with {sql} , and specify your connection with the connection=con code chunk option.

What is regressed queries in query store?

A 'regressed query' is one that received a worse Execution Plan then it had before. This report shows all the queries that have had their Execution Plan regressed over time.

Can you use R and SQL together?

Not only can you easily retrieve data from SQL Sources for analysis and visualisation in R, but you can also use SQL to create, clean, filter, query and otherwise manipulate datasets within R, using a wide choice of relational databases.

Which package enable the SQL queries in the R?

The sqldf package (CRAN, GitHub) runs SQL queries on R data frames by transparently setting up a database, loading data from R data frames into the database, running SQL queries in the database, and returning results as R data frames.


1 Answers

When you make the assignment outputPar[i] = result inside the serial foreach loop, this is OK (but not really the intended use of foreach). When you make this assignment in the parallel loop, it is not OK. See http://tolstoy.newcastle.edu.au/R/e10/help/10/04/3237.html for a similar question answered by David Smith at Revolution.

As a solution,

system.time(
  outputPar <- foreach(i = 1:2, .packages="RODBC")%dopar%{
#Connecting to database through RODBC
  ch=odbcConnect(dsn ,pwd = "xxxxxx", believeNRows=FALSE)
#Test connection
  odbcGetInfo(ch)
  result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
  result
}
)
like image 72
BenBarnes Avatar answered Sep 20 '22 23:09

BenBarnes