I am parsing some XML with the elementtree.parse() function. It works, except for some utf-8 characters(single byte character above 128). I see that the default parser is XMLTreeBuilder which is based on expat. Is there an alternative parser that I can use that may be less strict and allow utf-8 characters? This is the error I'm getting with the default parser: <pre class="prettyprint"><code>ExpatError: not well-formed (invalid token): line 311, column 190 </code></pre> The character causing this is a single byte x92 (in hex). I'm not certain this is even a valid utf-8 character. But it would be nice to handle it because most text editors display this as: í EDIT: The context of the character is: canít , where I assume it is supposed to be a fancy apostraphe, but in the hex editor, that same sequence is: 63 61 6E 92 74

I'll start from the question: "Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?" All XML parsers will accept data encoded in UTF-8. In fact, UTF-8 is the default encoding. An XML document may start with a declaration like this: <pre class="prettyprint"><code>`<?xml version="1.0" encoding="UTF-8"?>` </code></pre> or like this: <code><?xml version="1.0"?></code> or not have a declaration at all ... in each case the parser will decode the document using UTF-8. However your data is NOT encoded in UTF-8 ... it's probably Windows-1252 aka cp1252. If the encoding is not UTF-8, then either the creator should include a declaration (or the recipient can prepend one) or the recipient can transcode the data to UTF-8. The following showcases what works and what doesn't: <pre class="prettyprint"><code>>>> import xml.etree.ElementTree as ET >>> from StringIO import StringIO as sio >>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration >>> t = ET.parse(sio(raw_text)) [tracebacks omitted] xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9 # parser is expecting UTF-8 >>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text)) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47 # parser is expecting UTF-8 again >>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text)) >>> t.getroot().text u'can\u2019t' # parser was told to expect cp1252; it works >>> import unicodedata >>> unicodedata.name(u'\u2019') 'RIGHT SINGLE QUOTATION MARK' # not quite an apostrophe, but better than an exception >>> fixed_text = raw_text.decode('cp1252').encode('utf8') # alternative: we transcode the data to UTF-8 >>> t = ET.parse(sio(fixed_text)) >>> t.getroot().text u'can\u2019t' # UTF-8 is the default; no declaration needed </code></pre>

Alternative XML parser for ElementTree to ease UTF-8 woes?

I am parsing some XML with the elementtree.parse() function. It works, except for some utf-8 characters(single byte character above 128). I see that the default parser is XMLTreeBuilder which is based on expat.

Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?

This is the error I'm getting with the default parser:

ExpatError: not well-formed (invalid token): line 311, column 190

The character causing this is a single byte x92 (in hex). I'm not certain this is even a valid utf-8 character. But it would be nice to handle it because most text editors display this as: í

EDIT: The context of the character is: canít , where I assume it is supposed to be a fancy apostraphe, but in the hex editor, that same sequence is: 63 61 6E 92 74

What is ElementTree XML?

XML tree and elements XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree.

What is the role of parse () function in ElementTree?

There are two ways to parse the file using 'ElementTree' module. The first is by using the parse() function and the second is fromstring() function. The parse () function parses XML document which is supplied as a file whereas, fromstring parses XML when supplied as a string i.e within triple quotes.

Which Python module is best suited for parsing XML documents?

Python Module used: This article will focus on using inbuilt xml module in python for parsing XML and the main focus will be on the ElementTree XML API of this module. Above code will: Load RSS feed from specified URL and save it as an XML file.

I'll start from the question: "Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?"

All XML parsers will accept data encoded in UTF-8. In fact, UTF-8 is the default encoding.

An XML document may start with a declaration like this:

`<?xml version="1.0" encoding="UTF-8"?>`

or like this: <?xml version="1.0"?> or not have a declaration at all ... in each case the parser will decode the document using UTF-8.

However your data is NOT encoded in UTF-8 ... it's probably Windows-1252 aka cp1252.

If the encoding is not UTF-8, then either the creator should include a declaration (or the recipient can prepend one) or the recipient can transcode the data to UTF-8. The following showcases what works and what doesn't:

>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio

>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration

>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8

>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again

>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works

>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception

>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8

>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed

It looks like you have CP1252 text. If so, it should be specified at the top of the file, eg.:

<?xml version="1.0" encoding="CP1252" ?>

This does work with ElementTree.

If you're creating these files yourself, don't write them in this encoding. Save them as UTF-8 and do your part to help kill obsolete text encodings.

If you're receiving CP1252 data with no encoding specification, and you know for sure that it's always going to be CP1252, you can just convert it to UTF-8 before sending it to the parser:

s.decode("CP1252").encode("UTF-8")

Alternative XML parser for ElementTree to ease UTF-8 woes?

Tags:

python

xml

utf-8

elementtree

Kekoa

People also ask

2 Answers

John Machin

Glenn Maynard

Recent Activity

Donate For Us

Alternative XML parser for ElementTree to ease UTF-8 woes?

Tags:

python

xml

utf-8

elementtree

Kekoa

People also ask

2 Answers

John Machin

Glenn Maynard

Related questions

Recent Activity

Donate For Us