Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse XML with (X)HTML entities

Trying to parse XML, with ElementTree, that contains undefined entity (i.e.  ) raises:

ParseError: undefined entity  

In Python 2.x XML entity dict can be updated by creating parser (documentation):

parser = ET.XMLParser()
parser.entity["nbsp"] = unichr(160)

but how to do the same with Python 3.x?


Update: There was misunderstanding from my side, because I overlooked that I was calling parser.parser.UseForeignDTD(1) before trying to update XML entity dict, which was causing error with the parser. Luckily, @m.brindley was patient and pointed that XML entity dict still exists in Python 3.x and can be updated the same way as in Python 2.x

like image 302
theta Avatar asked Feb 07 '13 06:02

theta


1 Answers

The issue here is that the only valid mnemonic entities in XML are quot, amp, apos, lt and gt. This means that almost all (X)HTML named entities must be defined in the DTD using the entity declaration markup defined in the XML 1.1 spec. If the document is to be standalone, this should be done with an inline DTD like so:

<?xml version="1.1" ?>
<!DOCTYPE naughtyxml [
    <!ENTITY nbsp "&#0160;">
    <!ENTITY copy "&#0169;">
]>
<data>
    <country name="Liechtenstein">
        <rank>1&nbsp;&gt;</rank>
        <year>2008&copy;</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
</data>

The XMLParser in xml.etree.ElementTree uses an xml.parsers.expat to do the actual parsing. In the init arguments for XMLParser, there is a space for 'predefined HTML entities' but that argument is not implemented yet. An empty dict named entity is created in the init method and this is what is used to look up undefined entities.

I don't think expat (by extension, the ET XMLParser) is able to handle switching namespaces to something like XHMTL to get around this. Possibly because it will not fetch external namespace definitions (I tried making xmlns="http://www.w3.org/1999/xhtml" the default namespace for the data element but it did not play nicely) but I can't confirm that. By default, expat will raise an error against non XML entities but you can get around that by defining an external DOCTYPE - this causes the expat parser to pass undefined entity entries back to the ET.XMLParser's _default() method.

The _default() method does a look up of the entity dict in the XMLParser instance and if it finds a matching key, it will replace the entity with the associated value. This maintains the Python-2.x syntax mentioned in the question.

Solutions:

  • If the data does not have an external DOCTYPE and has (X)HTML mnemonic entities, you are out of luck. It is not valid XML and expat is right to throw an error. You should add an external DOCTYPE.
  • If the data has an external DOCTYPE, you can just use your old syntax to map mnemonic names to characters. Note: you should use chr() in py3k - unichr() is not a valid name anymore
    • Alternatively, you could update XMLParser.entity with html.entities.html5 to map all valid HTML5 mnemonic entities to their characters.
  • If the data is XHTML, you could subclass HTMLParser to handle mnemonic entities but this won't return an ElementTree as desired.

Here is the snippet I used - it parses XML with an external DOCTYPE through HTMLParser (to demonstrate how to add entity handling by subclassing), ET.XMLParser with entity mappings and expat (which will just silently ignore undefined entities due to the external DOCTYPE). There is a valid XML entity (&gt;) and an undefined entity (&copy;) which I map to chr(0x24B4) with the ET.XMLParser.

from html.parser import HTMLParser
from html.entities import name2codepoint
import xml.etree.ElementTree as ET
import xml.parsers.expat as expat

xml = '''<?xml version="1.0"?>
<!DOCTYPE data PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<data>
    <country name="Liechtenstein">
        <rank>1&gt;</rank>
        <year>2008&copy;</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
</data>'''

# HTMLParser subclass which handles entities
print('=== HTMLParser')
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, name, attrs):
        print('Start element:', name, attrs)
    def handle_endtag(self, name):
        print('End element:', name)
    def handle_data(self, data):
        print('Character data:', repr(data))
    def handle_entityref(self, name):
        self.handle_data(chr(name2codepoint[name]))

htmlparser = MyHTMLParser()
htmlparser.feed(xml)


# ET.XMLParser parse
print('=== XMLParser')
parser = ET.XMLParser()
parser.entity['copy'] = chr(0x24B8)
root = ET.fromstring(xml, parser)
print(ET.tostring(root))
for elem in root:
    print(elem.tag, ' - ', elem.attrib)
    for subelem in elem:
        print(subelem.tag, ' - ', subelem.attrib, ' - ', subelem.text)

# Expat parse
def start_element(name, attrs):
    print('Start element:', name, attrs)
def end_element(name):
    print('End element:', name)
def char_data(data):
    print('Character data:', repr(data))
print('=== Expat')
expatparser = expat.ParserCreate()
expatparser.StartElementHandler = start_element
expatparser.EndElementHandler = end_element
expatparser.CharacterDataHandler = char_data
expatparser.Parse(xml)
like image 99
m.brindley Avatar answered Sep 19 '22 14:09

m.brindley