I am trying to create an object tree from large number of xmls. However, when I run the following code on about 2000 xml files(ranging from 100KB to 200MB) (note that I have commented out the code that creates object tree), I get a large memory footprint of 8-9GB. I expect memory footprint to be minimum in the following example because the code doesn't doen't hold any references, it justs creates Elem and throws it away. The heap memory stays the same after running full GC.
def addDir(dir: File) {
dir.listFiles.filter(file => file.getName.endsWith("xml.gz")).foreach { gzipFile =>
addGzipFile(gzipFile)
}
}
def addGzipFile(gzipFile: File) {
val is = new BufferedInputStream(new GZIPInputStream(new FileInputStream(gzipFile)))
val xml = XML.load(is)
// parse xml and create object tree
is.close()
}
My JVM options are: -server -d64 -Xmx16G -Xss16M -XX:+DoEscapeAnalysis -XX:+UseCompressedOops
And the output of jmap -histo looks like this
num #instances #bytes class name ---------------------------------------------- 1: 67501390 1620033360 scala.collection.immutable.$colon$colon 2: 37249187 1254400536 [C 3: 37287806 1193209792 java.lang.String 4: 37200976 595215616 scala.xml.Text 5: 18600485 595215520 scala.xml.Elem 6: 3420921 82102104 scala.Tuple2 7: 213938 58213240 [I 8: 1140334 36490688 scala.collection.mutable.ListBuffer 9: 2280468 36487488 scala.runtime.ObjectRef 10: 1140213 36486816 scala.collection.Iterator$$anon$24 11: 1140210 36486720 scala.xml.parsing.FactoryAdapter$$anonfun$startElement$1 12: 1140210 27365040 scala.collection.immutable.Range$$anon$2 ... Total 213412869 5693850736
I cannot reproduce this behavior. I use the following program:
import java.io._
import xml.XML
object XMLLoadHeap {
val filename = "test.xml"
def addFile() {
val is = new BufferedInputStream(new FileInputStream(filename))
val xml = XML.load(is)
is.close()
println(xml.label)
}
def createXMLFile() {
val out = new FileWriter(filename)
out.write("<foo>\n")
(1 to 100000) foreach (i => out.write(" <bar baz=\"boom\"/>\n"))
out.write("</foo>\n")
out.close()
}
def main(args:Array[String]) {
println("XMLLoadHeap")
createXMLFile()
(1 to args(0).toInt) foreach { i =>
println("processing " + i)
addFile()
}
}
}
I run it with the following options: -Xmx128m -XX:+HeapDumpOnOutOfMemoryError -verbose:gc
and it basically looks like it can run indefinitely.
You can try to see if it does this when using only your largest XML file. It's possible the issue is not with processing many files, but just processing the biggest file. When testing here with a dummy 200MB XML file on a 64 bits machine, I see that I need around 3G of memory. If that's the case, you may need to use a pull parser. See XMLEventReader.
Other than that, assuming you don't create the object tree, you can use -Xmx4G -XX:+HeapDumpOnOutOfMemoryError
and then analyze the heap dump with a tool like MAT. 4GB should be sufficient to parse the largest XML file and by the time you get an out of memory error, there may be enough objects allocated to pinpoint which object is preventing GC. Most likely that will be an object holding on to the various parsed XML objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With