Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala - High heap usage when performed XML.loadFile on large number of files in local scope

I am trying to create an object tree from large number of xmls. However, when I run the following code on about 2000 xml files(ranging from 100KB to 200MB) (note that I have commented out the code that creates object tree), I get a large memory footprint of 8-9GB. I expect memory footprint to be minimum in the following example because the code doesn't doen't hold any references, it justs creates Elem and throws it away. The heap memory stays the same after running full GC.

def addDir(dir: File) {
dir.listFiles.filter(file => file.getName.endsWith("xml.gz")).foreach { gzipFile =>
    addGzipFile(gzipFile)
}
}
def addGzipFile(gzipFile: File) {
val is = new BufferedInputStream(new GZIPInputStream(new FileInputStream(gzipFile)))
val xml = XML.load(is)
// parse xml and create object tree
is.close()
}

My JVM options are: -server -d64 -Xmx16G -Xss16M -XX:+DoEscapeAnalysis -XX:+UseCompressedOops

And the output of jmap -histo looks like this

num     #instances         #bytes  class name
----------------------------------------------
   1:      67501390     1620033360  scala.collection.immutable.$colon$colon
   2:      37249187     1254400536  [C
   3:      37287806     1193209792  java.lang.String
   4:      37200976      595215616  scala.xml.Text
   5:      18600485      595215520  scala.xml.Elem
   6:       3420921       82102104  scala.Tuple2
   7:        213938       58213240  [I
   8:       1140334       36490688  scala.collection.mutable.ListBuffer
   9:       2280468       36487488  scala.runtime.ObjectRef
  10:       1140213       36486816  scala.collection.Iterator$$anon$24
  11:       1140210       36486720  scala.xml.parsing.FactoryAdapter$$anonfun$startElement$1
  12:       1140210       27365040  scala.collection.immutable.Range$$anon$2
...
Total     213412869     5693850736
like image 956
Sachin Kanekar Avatar asked Nov 05 '22 08:11

Sachin Kanekar


1 Answers

I cannot reproduce this behavior. I use the following program:

import java.io._
import xml.XML

object XMLLoadHeap {

  val filename = "test.xml"

  def addFile() {
    val is = new BufferedInputStream(new FileInputStream(filename))
    val xml = XML.load(is)
    is.close()
    println(xml.label)
  }

  def createXMLFile() {
    val out = new FileWriter(filename)
    out.write("<foo>\n")
    (1 to 100000) foreach (i => out.write("  <bar baz=\"boom\"/>\n"))
    out.write("</foo>\n")
    out.close()
  }

  def main(args:Array[String]) {
    println("XMLLoadHeap")
    createXMLFile()
    (1 to args(0).toInt) foreach { i => 
      println("processing " + i)
      addFile()
    }
  }

}

I run it with the following options: -Xmx128m -XX:+HeapDumpOnOutOfMemoryError -verbose:gc and it basically looks like it can run indefinitely.

You can try to see if it does this when using only your largest XML file. It's possible the issue is not with processing many files, but just processing the biggest file. When testing here with a dummy 200MB XML file on a 64 bits machine, I see that I need around 3G of memory. If that's the case, you may need to use a pull parser. See XMLEventReader.

Other than that, assuming you don't create the object tree, you can use -Xmx4G -XX:+HeapDumpOnOutOfMemoryError and then analyze the heap dump with a tool like MAT. 4GB should be sufficient to parse the largest XML file and by the time you get an out of memory error, there may be enough objects allocated to pinpoint which object is preventing GC. Most likely that will be an object holding on to the various parsed XML objects.

like image 142
huynhjl Avatar answered Nov 15 '22 07:11

huynhjl