Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse unicode strings with minidom?

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)

It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?

like image 887
dariopy Avatar asked Mar 16 '11 18:03

dariopy


People also ask

What is Minidom?

minidom is a minimal implementation of the Document Object Model interface, with an API similar to that in other languages. It is intended to be simpler than the full DOM and also significantly smaller. Users who are not already proficient with the DOM should consider using the xml.

What is Toprettyxml?

toprettyxml. n.toprettyxml(indent='\t',newl='\n') Returns a string, plain or Unicode, with the XML source for the subtree rooted at n, using indent to indent nested tags and newl to end lines. toxml. n.toxml( )

Is there a DOM in Python?

The DOM is a standard tree representation for XML data. The Document Object Model is being defined by the W3C in stages, or “levels” in their terminology. The Python mapping of the API is substantially based on the DOM Level 2 recommendation. DOM applications typically start by parsing some XML into a DOM.


1 Answers

Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.

If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.

like image 172
bobince Avatar answered Oct 19 '22 09:10

bobince