Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reducing memory footprint while using large XML DOM's in Java

Our application is required to take client data presented in XML format (several files) and parse this into our common XML format (a single file with schema). For this purpose we are using apache's XMLBeans data binding framework. The steps of this process are briefly described below.

First, we take raw java.io.File objects pointing to the client XML files on-disk and load these into a collection. We then iterate over this collection creating a single apache.xmlbeans.XmlObject per file. After all files have been parsed into XmlObjects, we create 4 collections holding the individual objects from the XML documents that we are interested in (to be clear, these are not hand-crafted objects but what I can only describe as 'proxy' objects created by apache's XMLBeans framework). As a final step, we then iterate over these collections to produce our XML document (in memory) and then save this to disk.

For the majority of use cases, this process works fine and can easily run in the JVM when given the '-Xmx1500m' command-line argument. However, issues arise when we are given 'large datasets' by the client. Large in this instance is 123Mb of client XML spread over 7 files. Such datasets result in our in-code collections being populated with almost 40,000 of the aforementioned 'proxy objects'. In these cases the memory usage just goes through the roof. I do not get any outofmemory exceptions the program just hangs until garbage collection occurs, free-ing up a small amount of memory, the program then continues, uses up this new space and the cycle repeats. These parsing sessions currently take 4-5 hours. We are aiming to bring this down to within an hour.

Its important to note that the calculations required to transform client xml into our xml require all of the xml data to cross-reference. Therefore we cannot implement a sequential parsing model or batch this process into smaller blocks.

What I've tried so far

Instead of holding all 123Mb of client xml in memory, on each request for data, load the files, find the data and release the references to these objects. This does seem to reduce the amount of memory consumed during the process but as you can imagine, the amount of time the constant I/O takes removes the benefit of the reduced memory footprint.

I suspected an issue was that we are holding an XmlObject[] for 123Mb worth of XML files as well as the collections of objects taken from these documents (using xpath queries). To remedy, I altered the logic so that instead of querying these collections, the documents were queried directly. The idea here being that at no point does there exist 4 massive Lists with 10's of 1000's of objects in, just the large collection of XmlObjects. This did not seem to make a difference at all and in some cases, increases the memory footprint even more.

Clutching at straws now, I considered that the XmlObject we use to build our xml in-memory before writing to disk was growing too large to maintain alongside all the client data. However, doing some sizeOf queries on this object revealed that at its largest, this object is less than 10Kb. After reading into how XmlBeans manages large DOM objects, it seems to use some form of buffered writer and as a result, is managing this object quite well.

So now I am out of ideas; Can't use SAX approaches instead of memory intensive DOM approaches as we need 100% of the client data in our app at any one time, cannot hold off requesting this data until we absolutely need it as the conversion process requires a lot of looping and the disk I/O time is not worth the saved memory space and I cannot seem to structure our logic in such a way as to reduce the amount of space the internal java collections occupy. Am I out of luck here? Must I just accept that if I want to parse 123Mb worth of xml data into our Xml format that I cannot do it with the 1500m memory allocation? While 123Mb is a large dataset in our domain, I cannot imagine others have never had to do something similar with Gb's of data at a time.

Other information that may be important

  • I have used JProbe to try and see if that can tell me anything useful. While I am a profiling noob, I ran through their tutorials for memory leaks and thread locks, understood them and there doesn't appear to be any leaks or bottlenecks in our code. After running the application with a large dataset, we quickly see a 'sawblade' type shape on the memory analysis screen (see attached image) with PS Eden space being taken over with a massive green block of PS Old Gen. This leads me to believe that the issue here is simply sheer amount of space taken up by object collections rather than a leak holding onto unused memory.

JProbe trace of memory usage during parsing of large dataset

  • I am running on a 64-Bit Windows 7 platform but this will need to run on a 32 Bit environment.
like image 433
user407356 Avatar asked Dec 12 '11 17:12

user407356


People also ask

What is the best way to parse XML in Java?

DOM Parser is the easiest java xml parser to learn. DOM parser loads the XML file into memory and we can traverse it node by node to parse the XML. DOM Parser is good for small files but when file size increases it performs slow and consumes more memory.

Which XML parser is more memory efficient?

The SAX event-based parser is faster and consumes far less memory than the DOM parser; consequently, it allows developers to parse the data out of an XML document more effectively.

Which parser is best in parsing in large size documents Why?

DOM Parser is faster than SAX Parser. Best for the larger sizes of files.

What XML processing model reach the entire XML document into the memory?

XPathDocument Class (System. Xml. XPath) Provides a fast, read-only, in-memory representation of an XML document by using the XPath data model.


1 Answers

The approach I'd take would be make two passes on the files, using SAX in both cases.

The first pass would parse the 'cross-reference' data, needed in the calculations, into custom objects and store them Maps. If the 'cross-reference' data is large then look at using distributed cache (Coherence is the natural fit if you've started with Maps).

The second pass would parse the files, retreive the 'cross-reference' data to perform calculations as needed and then write the output XML using the javax.xml.stream APIs.

like image 66
Nick Holt Avatar answered Sep 28 '22 01:09

Nick Holt