Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternative XML parser for ElementTree to ease UTF-8 woes?

I am parsing some XML with the elementtree.parse() function. It works, except for some utf-8 characters(single byte character above 128). I see that the default parser is XMLTreeBuilder which is based on expat.

Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?

This is the error I'm getting with the default parser:

ExpatError: not well-formed (invalid token): line 311, column 190

The character causing this is a single byte x92 (in hex). I'm not certain this is even a valid utf-8 character. But it would be nice to handle it because most text editors display this as: í

EDIT: The context of the character is: canít , where I assume it is supposed to be a fancy apostraphe, but in the hex editor, that same sequence is: 63 61 6E 92 74

like image 619
Kekoa Avatar asked Jul 16 '09 17:07

Kekoa


People also ask

What is ElementTree XML?

XML tree and elements XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree.

What is the role of parse () function in ElementTree?

There are two ways to parse the file using 'ElementTree' module. The first is by using the parse() function and the second is fromstring() function. The parse () function parses XML document which is supplied as a file whereas, fromstring parses XML when supplied as a string i.e within triple quotes.

Which Python module is best suited for parsing XML documents?

Python Module used: This article will focus on using inbuilt xml module in python for parsing XML and the main focus will be on the ElementTree XML API of this module. Above code will: Load RSS feed from specified URL and save it as an XML file.


2 Answers

I'll start from the question: "Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?"

All XML parsers will accept data encoded in UTF-8. In fact, UTF-8 is the default encoding.

An XML document may start with a declaration like this:

`<?xml version="1.0" encoding="UTF-8"?>`

or like this: <?xml version="1.0"?> or not have a declaration at all ... in each case the parser will decode the document using UTF-8.

However your data is NOT encoded in UTF-8 ... it's probably Windows-1252 aka cp1252.

If the encoding is not UTF-8, then either the creator should include a declaration (or the recipient can prepend one) or the recipient can transcode the data to UTF-8. The following showcases what works and what doesn't:

>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio

>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration

>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8

>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again

>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works

>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception

>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8

>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed
like image 112
John Machin Avatar answered Oct 16 '22 02:10

John Machin


It looks like you have CP1252 text. If so, it should be specified at the top of the file, eg.:

<?xml version="1.0" encoding="CP1252" ?>

This does work with ElementTree.

If you're creating these files yourself, don't write them in this encoding. Save them as UTF-8 and do your part to help kill obsolete text encodings.

If you're receiving CP1252 data with no encoding specification, and you know for sure that it's always going to be CP1252, you can just convert it to UTF-8 before sending it to the parser:

s.decode("CP1252").encode("UTF-8")
like image 20
Glenn Maynard Avatar answered Oct 16 '22 00:10

Glenn Maynard