Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python + Expat: Error on � entities

I have written a small function, which uses ElementTree and xpath to extract the text contents of certain elements in an xml file:

#!/usr/bin/env python2.5

import doctest
from xml.etree import ElementTree
from StringIO import StringIO

def parse_xml_etree(sin, xpath):
  """
Takes as input a stream containing XML and an XPath expression.
Applies the XPath expression to the XML and returns a generator
yielding the text contents of each element returned.

>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem1').next()
'one'
>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem2').next()
'two'
>>> parse_xml_etree(
...   StringIO('<test><null>&#0;</null><elem3>three</elem3></test>'),
...   '//elem2').next()
'three'
"""

  tree = ElementTree.parse(sin)
  for element in tree.findall(xpath):
    yield element.text  

if __name__ == '__main__':
  doctest.testmod(verbose=True)

The third test fails with the following exception:

ExpatError: reference to invalid character number: line 1, column 13

Is the &#0; entity illegal XML? Regardless whether it is or not, the files I want to parse contain it, and I need some way to parse them. Any suggestions for another parser than Expat, or settings for Expat, that would allow me to do that?


Update: I discovered BeautifulSoup just now, a tag soup parser as noted below in the answer comment, and for fun I went back to this problem and tried to use it as an XML-cleaner in front of ElementTree, but it dutifully converted the &#0; into a just-as-invalid null byte. :-)

cleaned_s = StringIO(
  BeautifulStoneSoup('<test><null>&#0;</null><elem3>three</elem3></test>',
                     convertEntities=BeautifulStoneSoup.XML_ENTITIES
  ).renderContents()
)
tree = ElementTree.parse(cleaned_s)

... yields

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 12

In my particular case though, I didn't really need the XPath parsing as such, I could have gone with BeautifulSoup itself and its quite simple node adressing style parsed_tree.test.elem1.contents[0].

like image 876
clacke Avatar asked Jun 14 '10 16:06

clacke


2 Answers

&#0; is not in the legal character range defined by the XML spec. Alas, my Python skills are pretty rudimentary, so I'm not much help there.

like image 89
McDowell Avatar answered Sep 28 '22 19:09

McDowell


&#0; is not a valid XML character. Ideally, you'd be able to get the creator of the file to change their process so that the file was not invalid like this.

If you must accept these files, you could pre-process them to turn &#0 into something else. For example, pick @ as an escape character, turn "@" into "@@", and "&#0;" into "@0".

Then as you get the text data from the parser, you can reverse the mapping. This is just an example, you can invent any escaping syntax you like.

like image 38
Ned Batchelder Avatar answered Sep 28 '22 18:09

Ned Batchelder