Python + Expat: Error on  entities

Question

I have written a small function, which uses ElementTree and xpath to extract the text contents of certain elements in an xml file:

#!/usr/bin/env python2.5

import doctest
from xml.etree import ElementTree
from StringIO import StringIO

def parse_xml_etree(sin, xpath):
  """
Takes as input a stream containing XML and an XPath expression.
Applies the XPath expression to the XML and returns a generator
yielding the text contents of each element returned.

>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem1').next()
'one'
>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem2').next()
'two'
>>> parse_xml_etree(
...   StringIO('<test><null>&#0;</null><elem3>three</elem3></test>'),
...   '//elem2').next()
'three'
"""

  tree = ElementTree.parse(sin)
  for element in tree.findall(xpath):
    yield element.text  

if __name__ == '__main__':
  doctest.testmod(verbose=True)

The third test fails with the following exception:

ExpatError: reference to invalid character number: line 1, column 13

Is the  entity illegal XML? Regardless whether it is or not, the files I want to parse contain it, and I need some way to parse them. Any suggestions for another parser than Expat, or settings for Expat, that would allow me to do that?

Update: I discovered BeautifulSoup just now, a tag soup parser as noted below in the answer comment, and for fun I went back to this problem and tried to use it as an XML-cleaner in front of ElementTree, but it dutifully converted the  into a just-as-invalid null byte. :-)

cleaned_s = StringIO(
  BeautifulStoneSoup('<test><null>&#0;</null><elem3>three</elem3></test>',
                     convertEntities=BeautifulStoneSoup.XML_ENTITIES
  ).renderContents()
)
tree = ElementTree.parse(cleaned_s)

... yields

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 12

In my particular case though, I didn't really need the XPath parsing as such, I could have gone with BeautifulSoup itself and its quite simple node adressing style parsed_tree.test.elem1.contents[0].

McDowell · Accepted Answer

 is not in the legal character range defined by the XML spec. Alas, my Python skills are pretty rudimentary, so I'm not much help there.

Ned Batchelder · Answer

 is not a valid XML character. Ideally, you'd be able to get the creator of the file to change their process so that the file was not invalid like this.

If you must accept these files, you could pre-process them to turn &#0 into something else. For example, pick @ as an escape character, turn "@" into "@@", and "" into "@0".

Then as you get the text data from the parser, you can reverse the mapping. This is just an example, you can invent any escaping syntax you like.

Python + Expat: Error on  entities

Tags:

python

parsing

xml

elementtree

expat-parser

clacke

2 Answers

McDowell

Ned Batchelder

Recent Activity

Donate For Us

Python + Expat: Error on &#0; entities