Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElementTree and unicode

Tags:

I have this char in an xml file:

<data>   <products>       <color>fumè</color>   </product> </data> 

I try to generate an instance of ElementTree with the following code:

string_data = open('file.xml') x = ElementTree.fromstring(unicode(string_data.encode('utf-8'))) 

and I get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 185: ordinal not in range(128) 

(NOTE: The position is not exact, I sampled the xml from a larger one).

How to solve it? Thanks

like image 920
pistacchio Avatar asked Sep 10 '12 10:09

pistacchio


People also ask

What is ElementTree?

The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.

What does Etree parse do?

The parse() function is used to parse from files and file-like objects. As an example of such a file-like object, the following code uses the BytesIO class for reading from a string instead of an external file.


2 Answers

Might you have stumbled upon this problem while using Requests (HTTP for Humans), response.text decodes the response by default, you can use response.content to get the undecoded data, so ElementTree can decode it itself. Just remember to use the correct encoding.

More info: http://docs.python-requests.org/en/latest/user/quickstart/#response-content

like image 52
gitaarik Avatar answered Sep 22 '22 07:09

gitaarik


You need to decode utf-8 strings into a unicode object. So

string_data.encode('utf-8') 

should be

string_data.decode('utf-8') 

assuming string_data is actually an utf-8 string.

So to summarize: To get an utf-8 string from a unicode object you encode the unicode (using the utf-8 encoding), and to turn a string to a unicode object you decode the string using the respective encoding.

For more details on the concepts I suggest reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (not Python specific).

like image 45
Lukas Graf Avatar answered Sep 24 '22 07:09

Lukas Graf