Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing very large XML files and marshalling to Java Objects

Tags:

java

xml

I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.

My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?

like image 269
Shivan Dragon Avatar asked Oct 12 '11 21:10

Shivan Dragon


3 Answers

I would recommend Eclipse EMF. But it has the same problem, if you give it the file name it would parse the whole thing. Although there are some options to reduce how much is loaded, but I didn't bother much as we run on machines with 96 GB RAM. :)

Anyway, If your XML format is well defined, then one workaround is to fool the EMF by breaking down the whole file into several smaller (but still well defined) XML snippets. Then feed each snippet one after the other. I don't know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend, because EMF is too big a hammer for such a small issue.

Just to elaborate a bit if your XML looks like this:

<tag1>
    <tag2>
        <tag3/>
        <tag4>
            <tag5/>
        </tag4>
        <tag6/>
        <tag7/>
    </tag2>

    <tag2>
        <tag3/>
        <tag4>
            <tag5/>
        </tag4>
        <tag6/>
        <tag7/>
    </tag2>
............
    <tag2>
        <tag3/>
        <tag4>
            <tag5/>
        </tag4>
        <tag6/>
        <tag7/>
    </tag2>
</tag1>

Then it can be broken down into one XML each starting with <tag2> and ending with </tag2>. And in java most parsers would accept a Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in a loop and pass to JAX-B or EMF.

HTH

like image 181
Kashyap Avatar answered Oct 30 '22 03:10

Kashyap


Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.

I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).

I will however make a small comment on what I finally ended up doing:

The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.

This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.

Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):

http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/

like image 24
Shivan Dragon Avatar answered Oct 30 '22 04:10

Shivan Dragon


Document projection might be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a reasonably simple query that selects a small amount of data from a large document, the query processor analyses the query to work out which parts of the tree need to be available for the query, and which can be discarded during processing. The resulting tree can often be only 1% of the size of the full document. Details for Saxon here:

http://saxonica.com/documentation/sourcedocs/projection.xml

like image 1
Michael Kay Avatar answered Oct 30 '22 03:10

Michael Kay