Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decode function tries to encode Python

I am trying to print a unicode string without the specific encoding hex in it. I'm grabbing this data from facebook which has an encoding type in the html headers of UTF-8. When I print the type - it says its unicode, but then when I try to decode it with unicode-escape says there is an encoding error. Why is it trying to encode when I use the decode method?

Code

a='really long string of unicode html text that i wont reprint'
print type(a)
 >>> <type 'unicode'>   
print a.decode('unicode-escape')
 >>> Traceback (most recent call last):
  File "scfbp.py", line 203, in myFunctionPage
    print a.decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 1945: ordinal not in range(128)
like image 633
JiminyCricket Avatar asked Jan 25 '11 23:01

JiminyCricket


2 Answers

It's not the decode that's failing. It's because you are trying to display the result to the console. When you use print it encodes the string using the default encoding which is ASCII. Don't use print and it should work.

>>> a=u'really long string containing \\u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
u'really long string containing \u20ac and some other text'
>>> print a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)

I'd recommend using IDLE or some other interpreter that can output unicode, then you won't get this problem.


Update: Note that this is not the same as the situtation with one less backslash, where it fails during the decode, but with the same error message:

>>> a=u'really long string containing \u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)
like image 111
Mark Byers Avatar answered Sep 21 '22 18:09

Mark Byers


When you print to the console Python tries to encode (convert) the string to the character set of your terminal. If this is not UTF-8, or something that doesn't map all the characters in the string, it will whine and throw an exception.

This snags me every now and then when I do quick processing of data, with for example Turkish characters in it.

If you are running python.exe through the Windows command prompt you can find some solutions here: What encoding/code page is cmd.exe using. Basically you can change the codepage with chcp but it's quite cumbersome. I would follow Mark's advice and use something like IDLE.

like image 32
Skurmedel Avatar answered Sep 19 '22 18:09

Skurmedel