Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference in performance between Stax and DOM parsing

I have been using DOM for a long time and as such DOM parsing performance wise has been pretty good. Even when dealing with XML of about 4-7 MB the parsing has been fast. The issue we face with DOM is the memory footprint which become huge as soon as we start dealing with large XMLs.

Lately I tried moving to Stax (Streaming parsers for XML) which are supposed top be second generation parsers (reading about Stax it said its the fastest parser now). When I tried Stax parser for large XML for about 4MB memory footprint definitely reduced drastically but time take to parse entire XML and create java object out of it increased almost by 5 times over DOM.

I used sjsxp.jar implementation of Stax.

I can deduce to some extent logically that performance may not be extremely good due to streaming nature of the parser but a reduction of 5 time (e.g. DOM takes about 8 seconds to build object for this XML, whereas Stax parsing took about 40 seconds on average) is definitely not going to be acceptable.

Am I missing some point here completely as I am not able to come to terms with these performance numbers

like image 436
Fazal Avatar asked Feb 27 '23 07:02

Fazal


2 Answers

package parsers;

/**
 *
 * @author Arthur Kushman
 */

import java.io.File;
import java.io.IOException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.Element;


public class DOMTest {

  public static void main(String[] args) {
  long time1 = System.currentTimeMillis();
   try {
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(new File("/Users/macpro/Desktop/myxml.xml"));
    doc.getDocumentElement().normalize();
    // System.out.println("Root Element: "+doc.getDocumentElement().getNodeName());
    NodeList nodeList = doc.getElementsByTagName("input");
    // System.out.println("Information of all elements in input");

    for (int s=0;s<nodeList.getLength();s++) {
      Node firstNode = nodeList.item(s);
      if (firstNode.getNodeType() == Node.ELEMENT_NODE) {
        Element firstElement = (Element)firstNode;
        NodeList firstNameElementList = firstElement.getElementsByTagName("href");
        Element firstNameElement = (Element)firstNameElementList.item(0);
        NodeList firstName = firstNameElement.getChildNodes();
        System.out.println("First Name: "+((Node)firstName.item(s)).getNodeValue());        
      }
    }


   } catch (Exception ex) {
    System.out.println(ex.getMessage());
    System.exit(1);
   }
  long time2 = System.currentTimeMillis() - time1;
  System.out.println(time2);
  }

}

45 mills

package parsers;

/**
 *
 * @author Arthur Kushman
 */
import javax.xml.stream.*;
import java.io.*;
import javax.xml.namespace.QName;

public class StAXTest {

  public static void main(String[] args) throws Exception {
  long time1 = System.currentTimeMillis();
    XMLInputFactory factory = XMLInputFactory.newInstance();
    // factory.setXMLReporter(myXMLReporter);
    XMLStreamReader reader = factory.createXMLStreamReader(
            new FileInputStream(
            new File("/Users/macpro/Desktop/myxml.xml")));

    /*String encoding = reader.getEncoding();

    System.out.println("Encoding: "+encoding);

    while (reader.hasNext()) {
      int event = reader.next();
      if (event == XMLStreamConstants.START_ELEMENT) {
        QName element = reader.getName();
        // String text = reader.getText();
        System.out.println("Element: "+element);
        // while (event != XMLStreamConstants.END_ELEMENT) {
          System.out.println("Text: "+reader.getLocalName());
        // }
      }
    }*/

  try {
    int inElement = 0;
    for (int event = reader.next();event != XMLStreamConstants.END_DOCUMENT;
    event = reader.next()) {
      switch (event) {
        case XMLStreamConstants.START_ELEMENT:
          if (isElement(reader.getLocalName(), "href")) {
            inElement++;
          }
          break;
        case XMLStreamConstants.END_ELEMENT:
          if (isElement(reader.getLocalName(), "href")) {
            inElement--;
            if (inElement == 0) System.out.println();
          }
          break;
        case XMLStreamConstants.CHARACTERS:
          if (inElement>0) System.out.println(reader.getText());
          break;
        case XMLStreamConstants.CDATA:
          if (inElement>0)  System.out.println(reader.getText());
          break;
      }
    }
    reader.close();
  } catch (XMLStreamException ex) {
    System.out.println(ex.getMessage());
    System.exit(1);
  }
    // System.out.println(System.currentTimeMillis());
    long time2 = System.currentTimeMillis() - time1;
    System.out.println(time2);
 }

  public static boolean isElement(String name, String element) {
    if (name.equals(element)) return true;
    return false;
  }

}

23 mills

StAX wins =)

like image 128
Arthur Kushman Avatar answered Mar 05 '23 17:03

Arthur Kushman


Although question lacks some details, I am pretty sure that the answer is that it's not parsing that is slow in either case (DOM is not parser; DOM trees are typically built using SAX or Stax parsers), but code above it that creates objects.

There are efficient automatic data binders, including JAXB (and with proper settings, XStream), which could help. They are faster than DOM, because the main performance problem with DOM (and JDOM, Dom4j and XOM) is that tree models are inherently expensive compared to POJOs -- they are basically glorified HashMaps, with lots of pointers for convenient untyped traversal; especially regarding memory usage.

As to parsers, Woodstox is faster Stax parser that Sjsxp; and Aalto is even faster if raw speed is of essence. But I doubt main issue is parser speed here.

like image 33
StaxMan Avatar answered Mar 05 '23 18:03

StaxMan