Parsing large pseudo-xml files in python

Question

I'm trying to parse* a large file (> 5GB) of structured markup data. The data format is essentially XML but there is no explicit root element. What's the most efficient way to do that?

The problem with SAX parsers is that they require a root element, so either I've to add a pseudo element to the data stream (is there an equivalent to Java's SequenceInputStream in Python?) or I've to switch to a non-SAX conform event-based parser (is there a successor of sgmllib?)

The structure of the data is quite simple. Basically a listing of elements:

<Document>
  <docid>1</docid>
  <text>foo</text>
</Document>
<Document>
  <docid>2</docid>
  <text>bar</text>
</Document>

*actually to iterate

liori · Accepted Answer

http://docs.python.org/library/xml.sax.html

Note, that you can pass a 'stream' object to xml.sax.parse. This means you can probably pass any object that has file-like methods (like read) to the parse call... Make your own object, which will firstly put your virtual root start-tag, then the contents of file, then virtual root end-tag. I guess that you only need to implement read method... but this might depend on the sax parser you'll use.

Example that works for me:

import xml.sax
import xml.sax.handler

class PseudoStream(object):
    def read_iterator(self):
        yield '<foo>'
        yield '<bar>'
        for line in open('test.xml'):
            yield line
        yield '</bar>'
        yield '</foo>'

    def __init__(self):
        self.ri = self.read_iterator()

    def read(self, *foo):
        try:
            return self.ri.next()
        except StopIteration:
            return ''

class SAXHandler(xml.sax.handler.ContentHandler):
    def startElement(self, name, attrs):
        print name, attrs

d = xml.sax.parse(PseudoStream(), SAXHandler())

Parsing large pseudo-xml files in python

Tags:

python

xml

Peter Prettenhofer

1 Answers

liori

Recent Activity

Donate For Us

Parsing large pseudo-xml files in python

Tags:

python

xml

Peter Prettenhofer

1 Answers

liori

Related questions

Recent Activity

Donate For Us