Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-Blocking method for parsing (streaming) XML in python

I have an XML document coming in over a socket that I need to parse and react to on the fly (ie parsing a partial tree). What I'd like is a non blocking method of doing so, so that I can do other things while waiting for more data to come in (without threading).

Something like iterparse would be ideal if it finished iterating when the read buffer was empty, eg:

context = iterparse(imaginary_socket_file_wrapper)
while 1:
    for event, elem in context:
        process_elem(elem)
    # iteration of context finishes when socket has no more data
    do_other_stuff()
    time.sleep(0.1)

I guess SAX would also be an option, but iterparse just seems simpler for my needs. Any ideas?

Update:

Using threads is fine, but introduces a level of complexity that I was hoping to sidestep. I thought that non-blocking calls would be a good way to do so, but I'm finding that it increases the complexity of parsing the XML.

like image 535
Peter Gibson Avatar asked Sep 22 '09 11:09

Peter Gibson


3 Answers

Diving into the iterparse source provided the solution for me. Here's a simple example of building an XML tree on the fly and processing elements after their close tags:

import xml.etree.ElementTree as etree

parser = etree.XMLTreeBuilder()

def end_tag_event(tag):
    node = self.parser._end(tag)
    print node

parser._parser.EndElementHandler = end_tag_event

def data_received(data):
    parser.feed(data)

In my case I ended up feeding it data from twisted, but it should work with a non-blocking socket also.

like image 183
Peter Gibson Avatar answered Oct 12 '22 22:10

Peter Gibson


I think there are two components to this, the non-blocking network I/O, and a stream-oriented XML parser.

For the former, you'd have to pick a non-blocking network framework, or roll your own solution for this. Twisted certainly would work, but I personally find inversion of control frameworks difficult to wrap my brain around. You would likely have to keep track of a lot of state in your callbacks to feed the parser. For this reason I tend to find Eventlet a bit easier to program to, and I think it would fit well in this situation.

Essentially it allows you to write your code as if you were using a blocking socket call (using an ordinary loop or a generator or whatever you like), except that you can spawn it into a separate coroutine (a "greenlet") that will automatically perform a cooperative yield when I/O operations would block, thus allowing other coroutines to run.

This makes using any stream-oriented parser trivial again, because the code is structured like an ordinary blocking call. It also means that many libraries that don't directly deal with sockets or other I/O (like the parser for instance) don't have to be specially modified to be non-blocking: if they block, Eventlet yields the coroutine.

Admittedly Eventlet is slightly magic, but I find it has a much easier learning curve than Twisted, and results in more straightforward code because you don't have to turn your logic "inside out" to fit the framework.

like image 24
edarc Avatar answered Oct 12 '22 23:10

edarc


If you won't use threads, you can use an event loop and poll non-blocking sockets.

asyncore is the standard library module for such stuff. Twisted is the async library for Python, but complex and probably a bit heavyweight for your needs.

Alternatively, multiprocessing is the non-thread thread alternative, but I assume you aren't running 2.6.

One way or the other, I think you're going to have to use threads, extra processes or weave some equally complex async magic.

like image 24
wbg Avatar answered Oct 12 '22 23:10

wbg