Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Play Framework 2.0 BodyParser - push parsing XML streams

I feel rather out of my depth asking this question, since despite reading the official docs and the resources linked in these questions:

How to understand `Iteratee` in play2?

Can't understand Iteratee, Enumerator, Enumeratee in Play 2.0

... I'm still pretty hazy about iteratees, enumerators, and Play 2.0's reactive model in general. But anyway, I'd like to set up a web service that allows me to upload large XML (>100MB) files, pick out certain specific (non-interleaved) NodeSeqs, process them, and stream the results back to the client.

I figure the first thing I need to do is write a BodyParser that takes chunks of bytes, feeds them to an XML parser, and emits a stream of the NodeSeqs I want, say <doc>...</doc>, in a lazy manner.

Could anyone offer any guidance and/or examples illustrating how this might be accomplished?

Update: More background :-

My XML is actually a Solr add document, so it looks like:

<add>
    <doc>
        <field name="name">Some Entity</field>
        <field name="details">Blah blah...</field>
        ...
    </doc>
    ...
</add>

I want to process each <doc> in a streaming manner, so my parser would obviously have to wait until it hit a <doc> start event, buffer everything until the equivalent </doc> end event, and emit a NodeSeq of the completed element, and then flush its buffer.

How this would work with a Play BodyParser, I am not entirely sure. More updates if I can further clarify what I want to do!

Although the whole XML file is large, each <doc /> element by itself is quite small, though I would obviously have to check that the byte buffer didn't exceed a certain size.

like image 827
Mikesname Avatar asked Jul 14 '12 19:07

Mikesname


1 Answers

Scanning the docs it seems it simply collects this information and supplys an entire org.w3c.Document for Java and a scala.xml for scala : play xml requests

That seems highly unlikely to help in your case as you'll end up with a big in memory model. For 100MB of xml you can expect anything up to 700MB of usage to parse.

Unfortunately none of the currently available (and known) xml libraries support feeding in chunks as per the Iteratee model. Scales Xml provides a way to process chunks from a stream (turning a pull parser into an Enumerator) - see here for examples.

As such currently I'd recommend taking a normal InputStream (or Reader) and feeding it into something similar to Scales. Perhaps a Play expert can recommend how to retrieve a stream (without fully processing it) from within the framework.

NB: The current final is shortly out but the next major release (0.5) will attempt to leverage aalto-xml to allow this partial stream processing (non-blocking) from both sides.

like image 103
Chris Avatar answered Oct 12 '22 07:10

Chris