Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading Huge XML File using StAX and XPath

Tags:

java

xml

xpath

stax

The input file contains thousands of transactions in XML format which is around 10GB of size. The requirement is to pick each transaction XML based on the user input and send it to processing system.

The sample content of the file

<transactions>
    <txn id="1">
      <name> product 1</name>
      <price>29.99</price>
    </txn>

    <txn id="2">
      <name> product 2</name>
      <price>59.59</price>
    </txn>
</transactions>

The (technical)user is expected to give the input tag name like <txn>.

We would like to provide this solution to be more generic. The file content might be different and users can give a XPath expression like "//transactions/txn" to pick individual transactions.

There are few technical things we have to consider here

  • The file can be in a shared location or FTP
  • Since the file size is huge, we can't load the entire file in JVM

Can we use StAX parser for this scenario? It has to take XPath expression as a input and pick/select transaction XML.

Looking for suggestions. Thanks in advance.

like image 231
Sivasubramaniam Arunachalam Avatar asked Aug 27 '11 16:08

Sivasubramaniam Arunachalam


2 Answers

Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.

Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?

like image 148
Jon7 Avatar answered Sep 27 '22 17:09

Jon7


If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]

As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:

  Document document = getParser().parse( source );

After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.

SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.

This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.

like image 30
Ichiro Furusato Avatar answered Sep 27 '22 18:09

Ichiro Furusato