Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a Push-based/Non-blocking XML Parser for Java?

I'm looking for an XML parser that instead of parsing from an InputStream or InputSource will instead allow blocks of text to be pushed into the parser. E.g. I would like to have something like the following:

public class DataReceiver {
    private SAXParser parser = //...
    private DefaultHandler handler = //...

    /**
     * Called each time some data is received.
     */
    public void onDataReceived(byte[] data) {
        parser.push(data, handler);
    }
}

The reason is that I would like something that will play nice with the NIO networking libraries rather than having to revert back to a thread per connection model required to support a blocking InputStream.

like image 989
Michael Barker Avatar asked Jun 21 '09 12:06

Michael Barker


People also ask

Which XML parser is best in Java?

DOM Parser is the easiest java xml parser to learn. DOM parser loads the XML file into memory and we can traverse it node by node to parse the XML. DOM Parser is good for small files but when file size increases it performs slow and consumes more memory.

Which XML parser is best in Java for large files?

When it comes to stream parsing in Java the SAX parser seems to be the most common choice. Most Stack Overflow Answers and Tutorials about parsing large XML files in Java point to the SAX parser.

How many types of XML parsers are available in Java?

There are two types of XML parsers namely Simple API for XML and Document Object Model.


3 Answers

Surprisingly no one mentioned one Java XML parser that does implement non-blocking ("async") parsing: Aalto. Part of the reason may be lack of documentation (and its low level of activity). Aalto implements basic Stax API, but also minor extensions to allow pushing input (this part has not been finalized; functionality exists but API is not finalized). For more information you could check out related discussion group.

like image 135
StaxMan Avatar answered Sep 26 '22 01:09

StaxMan


Edit: Now I see. You receive the XML in chunks and you want to feed it into a proper XML parser. So you need an object, which is a queue at the one end, and an InputStream at the other end?

You could aggregate the byte arrays received into a ByteArrayOutputStream, convert it to ByteArrayInputStream and feed it to the SAXParser.

Or you could check out the PipedInputStream/PipedOutputStream pair. In this case, you'll need to do the parsing in another thread as SAX parser uses the current thread to emit events, blocking your receive().

Edit: Based on the comments I suggest taking the aggregation route. You collect the chunks into a ByteArrayOutputStream. To know whether you received all chunks for your XML, check if the current chunk or the contents of the ByteArrayOutputStream contains your end tag of the XML root node. Then you could just pass the data into a SAXParser which can now run in the current thread without problems. To avoid unnecessary array re-creation you could implement your own unsynchronized simple byte array wrapper or look for such implementation.

like image 30
akarnokd Avatar answered Sep 22 '22 01:09

akarnokd


This is a (April 2009) post from the Xerces J-Users mailing list, where the original poster is having the exact same issue. One potentially very good response by "Jeff" is given, but there is no follow up to the original poster's response:

http://www.nabble.com/parsing-an-xml-document-chunk-by-chunk-td22945319.html

It's potentially new enough to bump on the list, or at very least help with the search.

Edit

Found another useful link, mentioning a library called Woodstox and describing the state of Stream vs. NIO based parsers and some possible approaches to emulating a stream:

http://markmail.org/message/ogqqcj7dt3lwkbov

like image 35
Harlan Iverson Avatar answered Sep 25 '22 01:09

Harlan Iverson