I'm writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox (Event API) to parse a file with 22.000 Nodes.
The algorithm is placed in a process with user interaction where only a few seconds response time are acceptable. So I need to improve the strategy how to handle the xml files.
Now I'm thinking about a multithreaded solution (which scales better on 16 Core+ hardware). I thought about the following stategies:
I want to improve both, the performance overall and the "per file" performance.
Do you have experience with such problems? What is the best way to go?
DOM Parser is the easiest java xml parser to learn. DOM parser loads the XML file into memory and we can traverse it node by node to parse the XML. DOM Parser is good for small files but when file size increases it performs slow and consumes more memory.
The XMLUnit library can be used to compare two XML files in Java. Similar to JUnit, XMLUnit can also be used to test XML files for comparison by extending the XMLTestcase class. It is a rich library and provides a detailed comparison of XML files.
By using suite-file-path we can mention the individual suite files and can execute the same like run as testng suite. This way we can execute multiple suite files from other xml file.
There are two types of XML parsers namely Simple API for XML and Document Object Model.
This one is obvious: just create several parsers and run them in parallel in multiple threads.
Take a look at Woodstox Performance (down at the moment, try google cache).
This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:
<element>
<more>more elements</more>
</element>
<element>
<other>other elements</other>
</element>
In this case you could create simple splitter that searches <element>
and feeds this part to a particular parser instance. That's a simplified approach: in real life I'd go with RandomAccessFile to find start stop points (<element>
) and then create custom FileInputStream that just operates on a part of file.
Take a look at Aalto. The same guys that created Woodstox. This are experts in this area - don't reinvent the wheel.
I am agree with Jim. I think that if you want to improve performance of overall processing of 1000 files your plan is good except #3 that is irrelevant in this case. If however you want to improve performance of parsing of single file you have a problem. I do not know how it is possible to split XML file without it parsing. Each chunk will be illegal XML and your parser will fail.
I believe that improving overall time is good enough for you. In this case read this tutorial: http://download.oracle.com/javase/tutorial/essential/concurrency/index.html then create thread pool of for example 100 threads and queue that contains XML sources. Each thread will parse only 10 files that will bring serious performance benefit in multi-CPU environment.
In addition to existing good suggestions there is one rather simple thing to do: use cursor API (XMLStreamReader), NOT Event API. Event API adds 30-50% overhead without (just IMO) significantly making processing easire. In fact, if you want convenience, I would recommend using StaxMate instead; it builds on top of Cursor API without adding significant overhead (at most 5-10% compared to hand-written code).
Now: I assume you have done basic optimizations with Woodstox; but if not, check out "3 Simple Rules for Fast XML-processing using Stax". Specifically, you absolutely should:
The reason I mention this is that while these make no functional difference (code works as expected) they can make big performance difference; although more so when processing smaller files.
Running multiple instances does also make sense; although usually with at most 1 thread per core. However you will only get benefit as long as your storage I/O can support such speeds; if disk is the bottleneck this will not help and can in some cases hurt (if disk seeks compete). But it is worth a try.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With