Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithmic complexity of XML parsers/validators

I need to know how the performance of different XML tools (parsers, validators, XPath expression evaluators, etc) is affected by the size and complexity of the input document. Are there resources out there that document how CPU time and memory usage are affected by... well, what? Document size in bytes? Number of nodes? And is the relationship linear, polynomial, or worse?

Update

In an article in IEEE Computer Magazine, vol 41 nr 9, sept 2008, the authors survey four popular XML parsing models (DOM, SAX, StAX and VTD). They run some very basic performance tests which show that a DOM-parser will have its throughput halved when the input file's size is increased from 1-15 KB to 1-15 MB, or about 1000x larger. The throughput of the other models is not significantly affected.

Unfortunately they did not perform more detailed studies, such as of throughput/memory usage as a function of number of nodes/size.

The article is here.

Update

I was unable to find any formal treatment of this problem. For what it's worth, I have done some experiments measuring the number of nodes in an XML document as a function of the document's size in bytes. I'm working on a warehouse management system and the XML documents are typical warehouse documents, e.g. advanced shipping notice etc.

The graph below shows the relationship between the size in bytes and the number of nodes (which should be proportional to the document's memory footprint under a DOM model). The different colors correspond to different kinds of documents. The scale is log/log. The black line is the best fit to the blue points. It's interesting to note that for all kinds of documents, the relationship between byte size and node size is linear, but that the coefficient of proportionality can be very different.

benchmarks-bytes_vs_nodes
(source: flickr.com)

like image 422
lindelof Avatar asked Aug 28 '08 08:08

lindelof


People also ask

Which XML parser is fastest Java?

The design is inspired by the design of VTD-XML, the fastest XML parser for Java I have seen, being even faster than the StAX and SAX Java standard XML parsers.

How can I read XML faster?

XmlReader is one of the fastest ways of reading in an XML file. It is forward-only, and read-only. The derived XmlTextReader is generally the class you would reach for. Bear in mind that the speed improvement is only appreciable for very, very large XML files.

Is XML easy to parse?

Well parsing XML is not an easy task. Its basic structure is a tree with any node in tree capable of holding a container which consists of an array of more trees.

Is XML fast?

JSON is faster because it is designed specifically for data interchange. JSON encoding is terse, which requires less bytes for transit. JSON parsers are less complex, which requires less processing time and memory overhead. XML is slower, because it is designed for a lot more than just data interchange.


2 Answers

If I was faced with that problem and couldn't find anything on google I would probably try to do it my self.

Some "back-of-an-evelope" stuff to get a feel for where it is going. But it would kinda need me to have an idea of how to do a xml parser. For non algorithmical benchmarks take a look here:

  • http://www.xml.com/pub/a/Benchmark/exec.html
  • http://www.devx.com/xml/Article/16922
  • http://xerces.apache.org/xerces2-j/faq-performance.html
like image 98
svrist Avatar answered Sep 19 '22 18:09

svrist


I think there are too many variables involved to come up with a simple complexity metric unless you make a lot of assumptions.

A simple SAX style parser should be linear in terms of document size and flat for memory.

Something like XPath would be impossible to describe in terms of just the input document since the complexity of the XPath expression plays a huge role.

Likewise for schema validation, a large but simple schema may well be linear, whereas a smaller schema that has a much more complex structure would show worse runtime performance.

As with most performance questions the only way to get accurate answers is to measure it and see what happens!

like image 45
Rob Walker Avatar answered Sep 23 '22 18:09

Rob Walker