Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing large pseudo-xml files in python

Tags:

python

xml

I'm trying to parse* a large file (> 5GB) of structured markup data. The data format is essentially XML but there is no explicit root element. What's the most efficient way to do that?

The problem with SAX parsers is that they require a root element, so either I've to add a pseudo element to the data stream (is there an equivalent to Java's SequenceInputStream in Python?) or I've to switch to a non-SAX conform event-based parser (is there a successor of sgmllib?)

The structure of the data is quite simple. Basically a listing of elements:

<Document>
  <docid>1</docid>
  <text>foo</text>
</Document>
<Document>
  <docid>2</docid>
  <text>bar</text>
</Document>

*actually to iterate

like image 342
Peter Prettenhofer Avatar asked Oct 02 '09 11:10

Peter Prettenhofer


1 Answers

http://docs.python.org/library/xml.sax.html

Note, that you can pass a 'stream' object to xml.sax.parse. This means you can probably pass any object that has file-like methods (like read) to the parse call... Make your own object, which will firstly put your virtual root start-tag, then the contents of file, then virtual root end-tag. I guess that you only need to implement read method... but this might depend on the sax parser you'll use.

Example that works for me:

import xml.sax
import xml.sax.handler

class PseudoStream(object):
    def read_iterator(self):
        yield '<foo>'
        yield '<bar>'
        for line in open('test.xml'):
            yield line
        yield '</bar>'
        yield '</foo>'

    def __init__(self):
        self.ri = self.read_iterator()

    def read(self, *foo):
        try:
            return self.ri.next()
        except StopIteration:
            return ''

class SAXHandler(xml.sax.handler.ContentHandler):
    def startElement(self, name, attrs):
        print name, attrs

d = xml.sax.parse(PseudoStream(), SAXHandler())
like image 131
liori Avatar answered Nov 14 '22 22:11

liori