I have six SQL queries that I script though R that each take a very long time (~30 minutes each). Once each query returns I then manipulate the data for some standard reports.
What I'd like to do is use my multicore machine to run these SQL requests in parallel from R.
I'm on a Windows machine with a Oracle DB. I was following a blog post to use doSNOW and foreach to try and split these requests and this is the best thing I can find on stackoverflow.
I've been able to get the process to work for the non-parallel %do% version of foreach but not the %dopar%. With %dopar% it just returns an empty set. Below is code that sets up tables and runs the queries so you can see what happens. Apologies in advance if there's too much basic code.
I've looked at some of the other R packages but didn't see an obvious solution. Also if you have a better way to manage this kind of process I'd be interested to hear it - just keep in mind I'm an analyst not a computer scientist. Thanks!
#Creating a cluster
library(doSNOW)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl)
#Connecting to database through RODBC
ch=odbcConnect("",pwd = "xxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)
#Creating database tables for example purposes
qryA1 <- "create table temptable(test int)"
qryA2 <- "insert into temptable(test) values((1))"
qryA3 <- "select * from temptable"
qryA4 <- "drop table temptable"
qryB1 <- "create table temptable2(test int)"
qryB2 <- "insert into temptable2(test) values((2))"
qryB3 <- "select * from temptable2"
qryB4 <- "drop table temptable2"
sqlQuery(ch, qryA1)
sqlQuery(ch, qryA2)
doesItWork <- sqlQuery(ch, qryA3)
doesItWork
sqlQuery(ch, qryB1)
sqlQuery(ch, qryB2)
doesItWork <- sqlQuery(ch, qryB3)
doesItWork
result = c()
output = c()
databases <- list('temptable','temptable2')
#Non-parallel version of foreach
system.time(
foreach(i = 1:2)%do%{
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))
output[i] = result
}
)
output
#Parallel version of foreach
outputPar = c()
system.time(
foreach(i = 1:2)%dopar%{
#Connecting to database through RODBC
ch=odbcConnect(dsn ,pwd = "xxxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))
outputPar[i] = result
}
)
outputPar
sqlQuery(ch, qryA4)
sqlQuery(ch, qryB4)
Did you know that you can run SQL code in an R Notebook code chunk? To use SQL, open an R Notebook in the RStudio IDE under the File > New File menu. Start a new code chunk with {sql} , and specify your connection with the connection=con code chunk option.
A 'regressed query' is one that received a worse Execution Plan then it had before. This report shows all the queries that have had their Execution Plan regressed over time.
Not only can you easily retrieve data from SQL Sources for analysis and visualisation in R, but you can also use SQL to create, clean, filter, query and otherwise manipulate datasets within R, using a wide choice of relational databases.
The sqldf package (CRAN, GitHub) runs SQL queries on R data frames by transparently setting up a database, loading data from R data frames into the database, running SQL queries in the database, and returning results as R data frames.
When you make the assignment outputPar[i] = result
inside the serial foreach loop, this is OK (but not really the intended use of foreach). When you make this assignment in the parallel loop, it is not OK. See http://tolstoy.newcastle.edu.au/R/e10/help/10/04/3237.html for a similar question answered by David Smith at Revolution.
As a solution,
system.time(
outputPar <- foreach(i = 1:2, .packages="RODBC")%dopar%{
#Connecting to database through RODBC
ch=odbcConnect(dsn ,pwd = "xxxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))
result
}
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With