python libxml2 reader and XML_PARSE_RECOVER

Question

I'm trying to get a reader to recover from broken XML. Using the libxml2.XML_PARSE_RECOVER option with the DOM api (libxml2.readDoc) works and it recovers from entity problems.

However using the option with the reader API (which is essential due to the size of documents we are parsing) does not work. It just gets stuck in a perpetual loop (with reader.Read() returning -1):

Sample code (with small example):

import cStringIO
import libxml2

DOC = "<a>some broken & xml</a>"

reader = libxml2.readerForDoc(DOC, "urn:bogus", None, libxml2.XML_PARSE_RECOVER | libxml2.XML_PARSE_NOERROR)

ret = reader.Read()
while ret:
    print 'ret: %d' % ret
    print "node name: ", reader.Name(), reader.NodeType()
    ret = reader.Read()

Any ideas how to recover correctly?

dcolish · Accepted Answer

I'm not too sure about the current state of the libxml2 bindings. Even the libxml2 site suggests using lxml instead. To parse this tree and ignore the & is nice and clean in lxml:

from cStringIO import StringIO
from lxml import etree

DOC = "<a>some broken & xml</a>"

reader = etree.XMLParser(recover=True)
tree = etree.parse(StringIO(DOC), reader)
print etree.tostring(tree.getroot())

The parsers page in the lxml docs goes into more detail about setting up a parser and iterating over the contents.

Edit:

If you want to parse a document incrementally the XMLparser class can be used as well since it is a subclass of _FeedParser:

DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)

for data in StringIO(DOC).read():
    reader.feed(data)

tree = reader.close()
print etree.tostring(tree)

python libxml2 reader and XML_PARSE_RECOVER

Tags:

python

libxml2

bee

1 Answers

dcolish

Recent Activity

Donate For Us

python libxml2 reader and XML_PARSE_RECOVER

Tags:

python

libxml2

bee

1 Answers

dcolish

Related questions

Recent Activity

Donate For Us