I'm trying to parse* a large file (> 5GB) of structured markup data. The data format is essentially XML but there is no explicit root element. What's the most efficient way to do that?
The problem with SAX parsers is that they require a root element, so either I've to add a pseudo element to the data stream (is there an equivalent to Java's SequenceInputStream in Python?) or I've to switch to a non-SAX conform event-based parser (is there a successor of sgmllib?)
The structure of the data is quite simple. Basically a listing of elements:
<Document>
<docid>1</docid>
<text>foo</text>
</Document>
<Document>
<docid>2</docid>
<text>bar</text>
</Document>
*actually to iterate
http://docs.python.org/library/xml.sax.html
Note, that you can pass a 'stream' object to xml.sax.parse
. This means you can probably pass any object that has file-like methods (like read
) to the parse
call... Make your own object, which will firstly put your virtual root start-tag, then the contents of file, then virtual root end-tag. I guess that you only need to implement read
method... but this might depend on the sax parser you'll use.
Example that works for me:
import xml.sax
import xml.sax.handler
class PseudoStream(object):
def read_iterator(self):
yield '<foo>'
yield '<bar>'
for line in open('test.xml'):
yield line
yield '</bar>'
yield '</foo>'
def __init__(self):
self.ri = self.read_iterator()
def read(self, *foo):
try:
return self.ri.next()
except StopIteration:
return ''
class SAXHandler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print name, attrs
d = xml.sax.parse(PseudoStream(), SAXHandler())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With