Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Able to run Python code with Unicode string in Eclipse, but getting UnicodeEncodeError when running via command line or Idle.

I've experienced this a lot, where I'll decode/encode some string of Unicode in Eclipse (PyDev), and it runs fine and how I expected, but then when I launch the same script from the command line (for example) instead, I'll get encoding errors.

Is there any simple explanation for this? Is Eclipse doing something to the Unicode/manipulating it in some different way?

EDIT:

Example:

value = u'\u2019'.decode( 'utf-8', 'ignore' )
return value

This works in Eclipse (PyDev) but not if I run it in Idle or on the command line.

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 135: ordinal not in range(128)

like image 422
Matthew Ark Avatar asked Dec 03 '22 07:12

Matthew Ark


2 Answers

Just wanted to add why it worked on PyDev: it has a special sitecustomize that'll customize python through sys.setdefaultencoding to use the encoding of the PyDev console.

Note that the response from bobince is correct, if you have a unicode string, you have to use the encode() method to transform it into a proper string (you'd use decode if you had a string and wanted to transform it into a unicode).

like image 176
Fabio Zadrozny Avatar answered Dec 05 '22 19:12

Fabio Zadrozny


value = u'\u2019'.decode( 'utf-8', 'ignore' )

Byte strings are DECODED into Unicode strings.

Unicode strings are ENCODED into byte strings.

So if you say someunicodestring.decode, it tries to coerce the Unicode string to a byte string, in order to be able to decode it (back to Unicode!). Being an implicit conversion, this encoding step will plump for the default encoding, which may differ between different environments, and is likely to be the ‘safe’ value ascii, which will certainly produce the error you mention as ASCII can't contain the character U+2019. It's almost never a good idea to rely on the default encoding.

So it doesn't make sense to try to decode a Unicode string. I'm pretty sure you mean:

value = u'\u2019'.encode('utf-8')

(ignore is redundant for encoding to UTF-8 as there is no character that this encoding can't represent.)

like image 32
bobince Avatar answered Dec 05 '22 21:12

bobince