I have a set of super simple XML files to parse... but... they use custom defined entities. I don't need to map these to characters, but I do wish to parse and act on each one. For example:
<Style name="admin-5678"> <Rule> <Filter>[admin_level]='5'</Filter> &maxscale_zoom11; </Rule> </Style>
There is a tantalizing hint at http://effbot.org/elementtree/elementtree-xmlparser.htm that XMLParser has limited entity support, but I can't find the methods mentioned, everything gives errors:
#!/usr/bin/python ## ## Where's the entity support as documented at: ## http://effbot.org/elementtree/elementtree-xmlparser.htm ## In Python 2.7.1+ ? ## from pprint import pprint from xml.etree import ElementTree from cStringIO import StringIO parser = ElementTree.ElementTree() #parser.entity["maxscale_zoom11"] = unichr(160) testf = StringIO('<foo>&maxscale_zoom11;</foo>') tree = parser.parse(testf) #tree = parser.parse(testf,"XMLParser") for node in tree.iter('foo'): print node.text
Which depending on how you adjust the comments gives:
xml.etree.ElementTree.ParseError: undefined entity: line 1, column 5
or
AttributeError: 'ElementTree' object has no attribute 'entity'
or
AttributeError: 'str' object has no attribute 'feed'
For those curious the XML is from the OpenStreetMap's mapnik project.
ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with.
To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().
Example Read XML File in Python To read an XML file, firstly, we import the ElementTree class found inside the XML library. Then, we will pass the filename of the XML file to the ElementTree. parse() method, to start parsing. Then, we will get the parent tag of the XML file using getroot() .
As @cnelson already pointed out in a comment, the chosen solution here won't work in Python 3.
I finally got it working. Quoted from this Q&A.
Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.
This works for both Python 2.6, 2.7, 3.3, 3.4.
import xml.etree.ElementTree as ET html = '''<html> <div>Some reasonably well-formed HTML content.</div> <form action="login"> <input name="foo" value="bar"/> <input name="username"/><input name="password"/> <div>It is not unusual to see in an HTML page.</div> </form></html>''' magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [ <!ENTITY nbsp ' '> ]>''' # You can define more entities here, if needed et = ET.fromstring(magic + html)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With