Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Server out-of-memory issue when using RJDBC in paralel computing environment

I have an R server with 16 cores and 8Gb ram that initializes a local SNOW cluster of, say, 10 workers. Each worker downloads a series of datasets from a Microsoft SQL server, merges them on some key, then runs analyses on the dataset before writing the results to the SQL server. The connection between the workers and the SQL server runs through a RJDBC connection. When multiple workers are getting data from the SQL server, ram usage explodes and the R server crashes.

The strange thing is that the ram usage by a worker loading in data seems disproportionally large compared to the size of the loaded dataset. Each dataset has about 8000 rows and 6500 columns. This translates to about 20MB when saved as an R object on disk and about 160MB when saved as a comma-delimited file. Yet, the ram usage of the R session is about 2,3 GB.

Here is an overview of the code (some typographical changes to improve readability):

Establish connection using RJDBC:

require("RJDBC")
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<username>","<pass>")

After this there is some code that sorts the function input vector requestedDataSets with names of all tables to query by number of records, such that we load the datasets from largest to smallest:

nrow.to.merge <- rep(0, length(requestedDataSets))
for(d in 1:length(requestedDataSets)){
nrow.to.merge[d] <- dbGetQuery(con, paste0("select count(*) from",requestedDataSets[d]))[1,1]
}
merge.order <- order(nrow.to.merge,decreasing = T)

We then go through the requestedDatasets vector and load and/or merge the data:

for(d in merge.order){
    # force reconnect to SQL server
    drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
    try(dbDisconnect(con), silent = T)
    con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<user>","<pass>")
    # remove the to.merge object
    rm(complete.data.to.merge)
    # force garbage collection
    gc()
    jgc()
    # ask database for dataset d
    complete.data.to.merge <- dbGetQuery(con, paste0("select * from",requestedDataSets[d]))
    # first dataset
    if (d == merge.order[1]){
        complete.data <- complete.data.to.merge
        colnames(complete.data)[colnames(complete.data) == "key"] <- "key_1"
    } 
    # later dataset
    else {
        complete.data <- merge(
                         x = complete.data, 
                         y = complete.data.to.merge,
                         by.x = "key_1", by.y = "key", all.x=T)
    }
}
return(complete.data)

When I run this code on a serie of twelve datasets, the number of rows/columns of the complete.data object is as expected, so it is unlikely the merge call somehow blows up the usage. For the twelve iterations memory.size() returns 1178, 1364, 1500, 1662, 1656, 1925, 1835, 1987, 2106, 2130, 2217, and 2361. Which, again, is strange as the dataset at the end is at most 162 MB...

As you can see in the code above I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection jgc <- function(){.jcall("java/lang/System", method = "gc")}). I've also tried merging the data SQL-server-side, but then I run into number of columns constraints.

It vexes me that the RAM usage is so much bigger than the dataset that is eventually created, leading me to believe there is some sort of buffer/heap that is overflowing... but I seem unable to find it.

Any advice on how to resolve this issue would be greatly appreciated. Let me know if (parts of) my problem description are vague or if you require more information.

Thanks.

like image 547
Predictor2016 Avatar asked Oct 18 '22 22:10

Predictor2016


1 Answers

This answer is more of a glorified comment. Simply because the data being processed on one node only requires 160MB does not mean that the amount of memory needed to process it is 160MB. Many algorithms require O(n^2) storage space, which would be be in the GB for your chunk of data. So I actually don't see anything here which is unsurprising.

I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection...

You can't force a garbage collection in Java, calling System.gc() only politely asks the JVM to do a garbage collection, but it is free to ignore the request if it wants. In any case, the JVM usually optimizes garbage collection well on its own, and I doubt this is your bottleneck. More likely, you are simply hitting on the overhead which R needs to crunch your data.

like image 186
Tim Biegeleisen Avatar answered Oct 20 '22 21:10

Tim Biegeleisen