Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a big XML file using stax and dom

Tags:

java

dom

xml

stax

I need to read several big (200Mb-500Mb) XML files, so I want to use StaX. My system has two modules - one to read the file ( with StaX ); another module ( 'parser' module ) suppose to get a single entry of that XML and parse it using DOM. My XML files don't have a certain structure - so I cannot use JaxB. How can I pass the 'parser' module a specific entry that I want it to parse? For example:

<Items>
   <Item>
        <name> .... </name>
        <price> ... </price>
   </Item>
   <Item>
        <name> .... </name>
        <price> ... </price>
   </Item>
</Items>

I want to use StaX to parse that file - but each 'item' entry will be passed to the 'parser' module.

Edit:
After a little more reading - I think I need a library that reads an XML file using stream - but parse each entry using DOM. Is there such a thing?

like image 966
Noam Avatar asked Feb 21 '12 15:02

Noam


People also ask

What is the best way to parse XML in Java?

DOM Parser is the easiest java xml parser to learn. DOM parser loads the XML file into memory and we can traverse it node by node to parse the XML. DOM Parser is good for small files but when file size increases it performs slow and consumes more memory.


2 Answers

You could use a StAX (javax.xml.stream) parser and transform (javax.xml.transform) each section to a DOM node (org.w3c.dom):

import java.io.*;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.dom.DOMResult;
import org.w3c.dom.*

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            DOMResult result = new DOMResult();
            t.transform(new StAXSource(xsr), result);
            Node domNode = result.getNode();
        }
    }

}

Also see:

  • Split 1GB Xml file using Java
like image 165
bdoughan Avatar answered Nov 12 '22 12:11

bdoughan


Blaise Doughan's answer fails in clean java 7 and 8 due to https://bugs.openjdk.java.net/browse/JDK-8016914

java.lang.NullPointerException
at com.sun.org.apache.xerces.internal.dom.CoreDocumentImpl.setXmlVersion(CoreDocumentImpl.java:860)
at com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM.setDocumentInfo(SAX2DOM.java:144)

Funny thing: if you use jaxb unmarshaller, you don't get the NPE:

package com.common.config;

import java.io.*;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBElement;
import javax.xml.bind.Unmarshaller;
import javax.xml.stream.*;

import org.w3c.dom.*;

public class Demo {


    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        // Advance to root element
        xsr.nextTag(); // TODO: nextTag() can't skip DTD
        xsr.next(); // Advance to first item or EOD

        final JAXBContext jaxbContext = JAXBContext.newInstance();
        final Unmarshaller unm = jaxbContext.createUnmarshaller();
        while(true) {
            // previous unmarshal() already did advance to next element or whitespace
            if (xsr.getEventType() == XMLStreamReader.START_ELEMENT) {
                JAXBElement<Object> jel = unm.unmarshal(xsr, Object.class);
                Node domNode = (Node)jel.getValue();
                System.err.println(domNode.getNodeName());
            } else if (!xsr.hasNext()) {
                    break;
            } else {
                xsr.next();
            }
        }
    }

}

The reason is: com.sun.xml.internal.bind.v2.runtime.unmarshaller.StAXConnector$1 does not implement Locator2 therefore it has no getXMLVersion().

like image 2
basin Avatar answered Nov 12 '22 10:11

basin