Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel XML Parsing in Java

I'm writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox (Event API) to parse a file with 22.000 Nodes.

The algorithm is placed in a process with user interaction where only a few seconds response time are acceptable. So I need to improve the strategy how to handle the xml files.

  1. My process analyses the xml files (extracts only a few nodes).
  2. Extracted nodes are processed and the new result is written into a new data stream (resulting in a copy of the document with modified nodes).

Now I'm thinking about a multithreaded solution (which scales better on 16 Core+ hardware). I thought about the following stategies:

  1. Creating multiple parsers and running them in parallel on the xml sources.
  2. Rewriting my parsing algorithm thread-save to use only one instance of the parser (factories, ...)
  3. Split the XML source into chunks and assign the chunks to multiple processing threads (map-reduce xml - serial)
  4. Optimizing my algorithm (better StAX parser than woodstox?) / Using a parser with build-in concurrency

I want to improve both, the performance overall and the "per file" performance.

Do you have experience with such problems? What is the best way to go?

like image 532
Martin K. Avatar asked Nov 17 '10 19:11

Martin K.


People also ask

What is the best way to parse XML in Java?

DOM Parser is the easiest java xml parser to learn. DOM parser loads the XML file into memory and we can traverse it node by node to parse the XML. DOM Parser is good for small files but when file size increases it performs slow and consumes more memory.

How do I compare two XML strings in Java?

The XMLUnit library can be used to compare two XML files in Java. Similar to JUnit, XMLUnit can also be used to test XML files for comparison by extending the XMLTestcase class. It is a rich library and provides a detailed comparison of XML files.

How do I run multiple TestNG XML files in parallel?

By using suite-file-path we can mention the individual suite files and can execute the same like run as testng suite. This way we can execute multiple suite files from other xml file.

How many types of XML parsers are available in Java?

There are two types of XML parsers namely Simple API for XML and Document Object Model.


3 Answers

  1. This one is obvious: just create several parsers and run them in parallel in multiple threads.

  2. Take a look at Woodstox Performance (down at the moment, try google cache).

  3. This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:

    <element>
        <more>more elements</more>
    </element> 
    <element>
        <other>other elements</other>
    </element>
    

    In this case you could create simple splitter that searches <element> and feeds this part to a particular parser instance. That's a simplified approach: in real life I'd go with RandomAccessFile to find start stop points (<element>) and then create custom FileInputStream that just operates on a part of file.

  4. Take a look at Aalto. The same guys that created Woodstox. This are experts in this area - don't reinvent the wheel.

like image 89
Peter Knego Avatar answered Oct 20 '22 19:10

Peter Knego


I am agree with Jim. I think that if you want to improve performance of overall processing of 1000 files your plan is good except #3 that is irrelevant in this case. If however you want to improve performance of parsing of single file you have a problem. I do not know how it is possible to split XML file without it parsing. Each chunk will be illegal XML and your parser will fail.

I believe that improving overall time is good enough for you. In this case read this tutorial: http://download.oracle.com/javase/tutorial/essential/concurrency/index.html then create thread pool of for example 100 threads and queue that contains XML sources. Each thread will parse only 10 files that will bring serious performance benefit in multi-CPU environment.

like image 26
AlexR Avatar answered Oct 20 '22 18:10

AlexR


In addition to existing good suggestions there is one rather simple thing to do: use cursor API (XMLStreamReader), NOT Event API. Event API adds 30-50% overhead without (just IMO) significantly making processing easire. In fact, if you want convenience, I would recommend using StaxMate instead; it builds on top of Cursor API without adding significant overhead (at most 5-10% compared to hand-written code).

Now: I assume you have done basic optimizations with Woodstox; but if not, check out "3 Simple Rules for Fast XML-processing using Stax". Specifically, you absolutely should:

  1. Make sure you only create XMLInputFactory and XMLOutputFactory instances once
  2. Close readers and writers to ensure buffer recycling (and other useful reuse) works as expected.

The reason I mention this is that while these make no functional difference (code works as expected) they can make big performance difference; although more so when processing smaller files.

Running multiple instances does also make sense; although usually with at most 1 thread per core. However you will only get benefit as long as your storage I/O can support such speeds; if disk is the bottleneck this will not help and can in some cases hurt (if disk seeks compete). But it is worth a try.

like image 2
StaxMan Avatar answered Oct 20 '22 19:10

StaxMan