I have an R server with 16 cores and 8Gb ram that initializes a local SNOW cluster of, say, 10 workers. Each worker downloads a series of datasets from a Microsoft SQL server, merges them on some key, then runs analyses on the dataset before writing the results to the SQL server. The connection between the workers and the SQL server runs through a RJDBC connection. When multiple workers are getting data from the SQL server, ram usage explodes and the R server crashes.
The strange thing is that the ram usage by a worker loading in data seems disproportionally large compared to the size of the loaded dataset. Each dataset has about 8000 rows and 6500 columns. This translates to about 20MB when saved as an R object on disk and about 160MB when saved as a comma-delimited file. Yet, the ram usage of the R session is about 2,3 GB.
Here is an overview of the code (some typographical changes to improve readability):
Establish connection using RJDBC:
require("RJDBC")
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<username>","<pass>")
After this there is some code that sorts the function input vector requestedDataSets with names of all tables to query by number of records, such that we load the datasets from largest to smallest:
nrow.to.merge <- rep(0, length(requestedDataSets))
for(d in 1:length(requestedDataSets)){
nrow.to.merge[d] <- dbGetQuery(con, paste0("select count(*) from",requestedDataSets[d]))[1,1]
}
merge.order <- order(nrow.to.merge,decreasing = T)
We then go through the requestedDatasets vector and load and/or merge the data:
for(d in merge.order){
# force reconnect to SQL server
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
try(dbDisconnect(con), silent = T)
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<user>","<pass>")
# remove the to.merge object
rm(complete.data.to.merge)
# force garbage collection
gc()
jgc()
# ask database for dataset d
complete.data.to.merge <- dbGetQuery(con, paste0("select * from",requestedDataSets[d]))
# first dataset
if (d == merge.order[1]){
complete.data <- complete.data.to.merge
colnames(complete.data)[colnames(complete.data) == "key"] <- "key_1"
}
# later dataset
else {
complete.data <- merge(
x = complete.data,
y = complete.data.to.merge,
by.x = "key_1", by.y = "key", all.x=T)
}
}
return(complete.data)
When I run this code on a serie of twelve datasets, the number of rows/columns of the complete.data object is as expected, so it is unlikely the merge call somehow blows up the usage. For the twelve iterations memory.size() returns 1178, 1364, 1500, 1662, 1656, 1925, 1835, 1987, 2106, 2130, 2217, and 2361. Which, again, is strange as the dataset at the end is at most 162 MB...
As you can see in the code above I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection jgc <- function(){.jcall("java/lang/System", method = "gc")}). I've also tried merging the data SQL-server-side, but then I run into number of columns constraints.
It vexes me that the RAM usage is so much bigger than the dataset that is eventually created, leading me to believe there is some sort of buffer/heap that is overflowing... but I seem unable to find it.
Any advice on how to resolve this issue would be greatly appreciated. Let me know if (parts of) my problem description are vague or if you require more information.
Thanks.
This answer is more of a glorified comment. Simply because the data being processed on one node only requires 160MB does not mean that the amount of memory needed to process it is 160MB. Many algorithms require O(n^2)
storage space, which would be be in the GB for your chunk of data. So I actually don't see anything here which is unsurprising.
I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection...
You can't force a garbage collection in Java, calling System.gc()
only politely asks the JVM to do a garbage collection, but it is free to ignore the request if it wants. In any case, the JVM usually optimizes garbage collection well on its own, and I doubt this is your bottleneck. More likely, you are simply hitting on the overhead which R needs to crunch your data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With