Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R does not stop grabbing memory / RAM due to XML

Tags:

loops

r

xml

I have an double loop like the one shown below the problem is that R (2.15.2) is using more and more memory and I do not understand why.

While I understand that this has to happen within the inner cycle because of the rbind() I am doing there, I do not understand why R keeps grabbing memory when a new cycle of the outer loop starts and actually the objects ( 'xmlCatcher' ) are reused:

# !!!BEWARE this example creates a lot of files (n=1000)!!!!

require(XML)

chunk <- function(x, chunksize){
        # source: http://stackoverflow.com/a/3321659/1144966
        x2 <- seq_along(x)
        split(x, ceiling(x2/chunksize))
    }

chunky <- chunk(paste("test",1:1000,".xml",sep=""),100)

for(i in 1:1000){
writeLines(c(paste('<?xml version="1.0"?>\n <note>\n    <to>Tove</to>\n    <nr>',i,'</nr>\n    <from>Jani</from>\n    <heading>Reminder</heading>\n    ',sep=""), paste(rep('<body>Do not forget me this weekend!</body>\n',sample(1:10, 1)),sep="" ) , ' </note>')
,paste("test",i,".xml",sep=""))
}

for(k in 1:length(chunky)){
gc()
print(chunky[[k]])
xmlCatcher <- NULL

for(i in 1:length(chunky[[k]])){
    filename    <- chunky[[k]][i]
    xml         <- xmlTreeParse(filename)
    xml         <- xmlRoot(xml)
    result      <- sapply(getNodeSet(xml,"//body"), xmlValue)
    id          <- sapply(getNodeSet(xml,"//nr"), xmlValue)
    dummy       <- cbind(id,result)
    xmlCatcher  <- rbind(xmlCatcher,dummy)
    }
save(xmlCatcher,file=paste("xmlCatcher",k,".RData"))
}

Does somebody have an idea why this behaviour might occur? Note that all the objects (like 'xmlCatcher') are reused every cycle so that I would assume that the RAM used should stay about the same after the first 'chunk' cycle.

  • Garbage collection does not change a thing.
  • Not using rbind does not change a thing.
  • Using less xml-functions actually results in less memory grabbing - But Why?

Is this a bug or do I miss something?

like image 343
petermeissner Avatar asked Dec 11 '22 19:12

petermeissner


2 Answers

Your understanding of reusing memory is wong.

When you create the new DummyCatcher, the old one is no longer referenced and then becomes candidate for garbage collection, which will happen at some point.

You are not reusing memory, you are creating a new object and abandon the old one.

Garbage collection will free the memory.

Also, i suggest you look at Rprofmem to profile your memory use.

like image 140
Romain Francois Avatar answered Dec 20 '22 11:12

Romain Francois


The chpater 2 of this talk about the rbind as a common|means of being a glutton.

You can avoid the use of rbind inside the loop,

my.list <- vector('list', chunk[k])
for(i in 1:chunk[k]) {
   dummy <- dummy + 1
   my.list[[i]] <- data.frame(dummy)
}
DummyCatcher  <- do.call('rbind', my.list)
like image 29
agstudy Avatar answered Dec 20 '22 12:12

agstudy