Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python ElementTree support for parsing unknown XML entities?

Tags:

I have a set of super simple XML files to parse... but... they use custom defined entities. I don't need to map these to characters, but I do wish to parse and act on each one. For example:

<Style name="admin-5678">     <Rule>       <Filter>[admin_level]='5'</Filter>       &maxscale_zoom11;     </Rule> </Style> 

There is a tantalizing hint at http://effbot.org/elementtree/elementtree-xmlparser.htm that XMLParser has limited entity support, but I can't find the methods mentioned, everything gives errors:

    #!/usr/bin/python     ##     ## Where's the entity support as documented at:     ## http://effbot.org/elementtree/elementtree-xmlparser.htm     ## In Python 2.7.1+ ?     ##     from pprint     import pprint     from xml.etree  import ElementTree     from cStringIO  import StringIO      parser = ElementTree.ElementTree()    #parser.entity["maxscale_zoom11"] = unichr(160)     testf = StringIO('<foo>&maxscale_zoom11;</foo>')     tree = parser.parse(testf)    #tree = parser.parse(testf,"XMLParser")     for node in tree.iter('foo'):         print node.text 

Which depending on how you adjust the comments gives:

xml.etree.ElementTree.ParseError: undefined entity: line 1, column 5 

or

AttributeError: 'ElementTree' object has no attribute 'entity' 

or

AttributeError: 'str' object has no attribute 'feed'            

For those curious the XML is from the OpenStreetMap's mapnik project.

like image 537
Bryce Avatar asked Aug 30 '11 00:08

Bryce


People also ask

What is Python ElementTree?

ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with.

How do you access XML elements in Python?

To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().

How do you read a specific tag in an XML file in Python?

Example Read XML File in Python To read an XML file, firstly, we import the ElementTree class found inside the XML library. Then, we will pass the filename of the XML file to the ElementTree. parse() method, to start parsing. Then, we will get the parent tag of the XML file using getroot() .


1 Answers

As @cnelson already pointed out in a comment, the chosen solution here won't work in Python 3.

I finally got it working. Quoted from this Q&A.

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET  html = '''<html>     <div>Some reasonably well-formed HTML content.</div>     <form action="login">     <input name="foo" value="bar"/>     <input name="username"/><input name="password"/>      <div>It is not unusual to see &nbsp; in an HTML page.</div>      </form></html>'''  magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"             "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [             <!ENTITY nbsp ' '>             ]>'''  # You can define more entities here, if needed  et = ET.fromstring(magic + html) 
like image 157
RayLuo Avatar answered Oct 04 '22 20:10

RayLuo