Reducing memory footprint while using large XML DOM's in Java

Tags:

Our application is required to take client data presented in XML format (several files) and parse this into our common XML format (a single file with schema). For this purpose we are using apache's XMLBeans data binding framework. The steps of this process are briefly described below.

First, we take raw java.io.File objects pointing to the client XML files on-disk and load these into a collection. We then iterate over this collection creating a single apache.xmlbeans.XmlObject per file. After all files have been parsed into XmlObjects, we create 4 collections holding the individual objects from the XML documents that we are interested in (to be clear, these are not hand-crafted objects but what I can only describe as 'proxy' objects created by apache's XMLBeans framework). As a final step, we then iterate over these collections to produce our XML document (in memory) and then save this to disk.

For the majority of use cases, this process works fine and can easily run in the JVM when given the '-Xmx1500m' command-line argument. However, issues arise when we are given 'large datasets' by the client. Large in this instance is 123Mb of client XML spread over 7 files. Such datasets result in our in-code collections being populated with almost 40,000 of the aforementioned 'proxy objects'. In these cases the memory usage just goes through the roof. I do not get any outofmemory exceptions the program just hangs until garbage collection occurs, free-ing up a small amount of memory, the program then continues, uses up this new space and the cycle repeats. These parsing sessions currently take 4-5 hours. We are aiming to bring this down to within an hour.

Its important to note that the calculations required to transform client xml into our xml require all of the xml data to cross-reference. Therefore we cannot implement a sequential parsing model or batch this process into smaller blocks.

What I've tried so far

Instead of holding all 123Mb of client xml in memory, on each request for data, load the files, find the data and release the references to these objects. This does seem to reduce the amount of memory consumed during the process but as you can imagine, the amount of time the constant I/O takes removes the benefit of the reduced memory footprint.

I suspected an issue was that we are holding an XmlObject[] for 123Mb worth of XML files as well as the collections of objects taken from these documents (using xpath queries). To remedy, I altered the logic so that instead of querying these collections, the documents were queried directly. The idea here being that at no point does there exist 4 massive Lists with 10's of 1000's of objects in, just the large collection of XmlObjects. This did not seem to make a difference at all and in some cases, increases the memory footprint even more.

Clutching at straws now, I considered that the XmlObject we use to build our xml in-memory before writing to disk was growing too large to maintain alongside all the client data. However, doing some sizeOf queries on this object revealed that at its largest, this object is less than 10Kb. After reading into how XmlBeans manages large DOM objects, it seems to use some form of buffered writer and as a result, is managing this object quite well.

So now I am out of ideas; Can't use SAX approaches instead of memory intensive DOM approaches as we need 100% of the client data in our app at any one time, cannot hold off requesting this data until we absolutely need it as the conversion process requires a lot of looping and the disk I/O time is not worth the saved memory space and I cannot seem to structure our logic in such a way as to reduce the amount of space the internal java collections occupy. Am I out of luck here? Must I just accept that if I want to parse 123Mb worth of xml data into our Xml format that I cannot do it with the 1500m memory allocation? While 123Mb is a large dataset in our domain, I cannot imagine others have never had to do something similar with Gb's of data at a time.

Other information that may be important

I have used JProbe to try and see if that can tell me anything useful. While I am a profiling noob, I ran through their tutorials for memory leaks and thread locks, understood them and there doesn't appear to be any leaks or bottlenecks in our code. After running the application with a large dataset, we quickly see a 'sawblade' type shape on the memory analysis screen (see attached image) with PS Eden space being taken over with a massive green block of PS Old Gen. This leads me to believe that the issue here is simply sheer amount of space taken up by object collections rather than a leak holding onto unused memory.

JProbe trace of memory usage during parsing of large dataset

I am running on a 64-Bit Windows 7 platform but this will need to run on a 32 Bit environment.

433

asked Dec 12 '11 17:12

user407356

1 Answers

The approach I'd take would be make two passes on the files, using SAX in both cases.

The first pass would parse the 'cross-reference' data, needed in the calculations, into custom objects and store them Maps. If the 'cross-reference' data is large then look at using distributed cache (Coherence is the natural fit if you've started with Maps).

The second pass would parse the files, retreive the 'cross-reference' data to perform calculations as needed and then write the output XML using the javax.xml.stream APIs.

answered Sep 28 '22 01:09

Nick Holt

Related questions
                            
                                Delete fail or success from ContentProviderResult[]?
                            
                                access private variable from other class in java
                            
                                Step by step login example using spring security 3.0 with hdbc
                            
                                Android PCM to Ulaw encoding wav file
                            
                                Inherited some bad Java generics
                            
                                how to use XPath to find the node value with CDATA tag in java
                            
                                Determine unused jars by code coverage?
                            
                                Null pointer exception while importing a font in android
                            
                                Java Validator locks file on failure - what am I doing wrong?
                            
                                Can I drop table using hibernate native SQL query
                            
                                Converting a string to the Enum class
                            
                                Java - find host by MAC address
                            
                                Spring lazy initialization in development environment
                            
                                Vaadin Textarea auto height
                            
                                Factory of singleton objects: is this code thread-safe?
                            
                                How to run the command 'pdflatex' in Java on Mac
                            
                                Reliable UDP Protocol Implementation in Java - Why does this happen?
                            
                                How to automatically check/ensure if all external libraries are included lib folder in eclipse?
                            
                                Emoticons in EditText
                            
                                Why does the write(int b) method of OutputStream exist? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reducing memory footprint while using large XML DOM's in Java

Tags:

java

memory

xml-parsing

user407356

People also ask

1 Answers

Nick Holt

Recent Activity

Donate For Us