Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading local chunks in DOM while parsing a large XML file in SAX (Java)

Tags:

java

dom

xml

xpath

sax

I've an xml file that I would avoid having to load all in memory. As everyone know, for such a file I better have to use a SAX parser (which will go along the file and call for events if something relevant is found.)

My current problem is that I would like to process the file "by chunk" which means:

  1. Parse the file and find a relevant tag (node)
  2. Load this tag entirely in memory (like we would do it in DOM)
  3. Do the process of this entity (that local chunk)
  4. When I'm done with the chunk, release it and continue to 1. (until "end of file")

In a perfect world I'm searching some something like this:

// 1. Create a parser and set the file to load
      IdealParser p = new IdealParser("BigFile.xml");
// 2. Set an XPath to define the interesting nodes
      p.setRelevantNodesPath("/path/to/relevant/nodes");
// 3. Add a handler to callback the right method once a node is found
      p.setHandler(new Handler(){
// 4. The method callback by the parser when a relevant node is found
      void aNodeIsFound(saxNode aNode)
   {
   // 5. Inflate the current node i.e. load it (and all its content) in memory
         DomNode d = aNode.expand();
   // 6. Do something with the inflated node (method to be defined somewhere)
         doThingWithNode(d);
    }
   });
// 7. Start the parser
      p.start();

I'm currently stuck on how to expand a "sax node" (understand me…) efficiently.

Is there any Java framework or library relevant to this kind of task?

like image 333
Flavien Volken Avatar asked Nov 03 '11 16:11

Flavien Volken


People also ask

How can parsing the XML data using DOM and SAX?

The two common ways to parse an XML document are given below: DOM Parser: Parsing the document by loading all the content of the document and creating its hierarchical tree structure. SAX Parser: Parsing based on event-based triggers. It does not require the complete loading of content.

What are the limitations of SAX in XML?

1)SAX is an event-driven push model for processing XML. It is not a W3C standard, 2)Rather than building a tree representation of an entire document as DOM does, a SAX parser fires off a series of events as it reads through the document. 4) you have to keep track of where the parser is in the document hierarchy.

Is SAX parser faster than DOM?

SAX Parser is slower than DOM Parser.


1 Answers

UPDATE

You could also just use the javax.xml.xpath APIs:

package forum7998733;

import java.io.FileReader;
import javax.xml.xpath.*;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;

public class XPathDemo {

    public static void main(String[] args) throws Exception {
        XPathFactory xpf = XPathFactory.newInstance();
        XPath xpath = xpf.newXPath();
        InputSource xml = new InputSource(new FileReader("BigFile.xml"));
        Node result = (Node) xpath.evaluate("/path/to/relevant/nodes", xml, XPathConstants.NODE);
        System.out.println(result);
    }

}

Below is a sample of how it could be done with StAX.

input.xml

Below is some sample XML:

<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>
</statements>

Demo

In this example a StAX XMLStreamReader is used to find the node that will be converted to a DOM. In this example we convert each statement fragment to a DOM, but your navigation algorithm could be more advanced.

package forum7998733;

import java.io.FileReader;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.*;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum7998733/input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            DOMResult domResult = new DOMResult();
            t.transform(new StAXSource(xsr), domResult);

            DOMSource domSource = new DOMSource(domResult.getNode());
            StreamResult streamResult = new StreamResult(System.out);
            t.transform(domSource, streamResult);
        }
    }

}

Output

<?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="123">
      ...stuff...
   </statement><?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="456">
      ...stuff...
   </statement>
like image 156
bdoughan Avatar answered Sep 19 '22 01:09

bdoughan