Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is sys.getdefaultencoding() different from sys.stdout.encoding and how does this break Unicode strings?

I spent a few angry hours looking for the problem with Unicode strings that was broken down to something that Python (2.7) hides from me and I still don't understand. First, I tried to use u".." strings consistently in my code, but that resulted in the infamous UnicodeEncodeError. I tried using .encode('utf8'), but that didn't help either. Finally, it turned out I shouldn't use either and it all works out automagically. However, I (here I need to give credit to a friend who helped me) did notice something weird while banging my head against the wall. sys.getdefaultencoding() returns ascii, while sys.stdout.encoding returns UTF-8. 1. in the code below works fine without any modifications to sys and 2. raises a UnicodeEncodeError. If I change the default system encoding with reload(sys).setdefaultencoding("utf8"), then 2. works fine. My question is why the two encoding variables are different in the first place and how do I manage to use the wrong encoding in this simple piece of code? Please, don't send me to the Unicode HOWTO, I've read that obviously in the tens of questions about UnicodeEncodeError.

#  -*- coding: utf-8 -*-
import sys


class Token:
    def __init__(self, string, final=False):
        self.value = string
        self.final = final

    def __str__(self):
        return self.value

    def __repr__(self):
        return self.value

print(sys.getdefaultencoding())
print(sys.stdout.encoding)

# 1.
myString = "I need 20 000€."
tok = Token(myString)
print(tok)

reload(sys).setdefaultencoding("utf8")

# 2.
myString = u"I need 20 000€."
tok = Token(myString)
print(tok)
like image 313
Aleksandar Savkov Avatar asked Mar 20 '13 17:03

Aleksandar Savkov


2 Answers

My question is why the two encoding variables are different in the first place

They serve different purposes.

sys.stdout.encoding should be the encoding that your terminal uses to interpret text otherwise you may get mojibake in the output. It may be utf-8 in one environment, cp437 in another, etc.

sys.getdefaultencoding() is used on Python 2 for implicit conversions (when the encoding is not set explicitly) i.e., Python 2 may mix ascii-only bytestrings and Unicode strings together e.g., xml.etree.ElementTree stores text in ascii range as bytestrings or json.dumps() returns an ascii-only bytestring instead of Unicode in Python 2 — perhaps due to performance — bytes were cheaper than Unicode for representing ascii characters. Implicit conversions are forbidden in Python 3.

sys.getdefaultencoding() is always 'ascii' on all systems in Python 2 unless you override it that you should not do otherwise it may hide bugs and your data may be easily corrupted due to the implicit conversions using a possibly wrong encoding for the data.

btw, there is another common encoding sys.getfilesystemencoding() that may be different from the two. sys.getfilesystemencoding() should be the encoding that is used to encode OS data (filenames, command-line arguments, environment variables).

The source code encoding declared using # -*- coding: utf-8 -*- may be different from all of the already-mentioned encodings.

Naturally, if you read data from a file, network; it may use character encodings different from the above e.g., if a file created in notepad is saved using Windows ANSI encoding such as cp1252 then on another system all the standard encodings can be different from it.

The point being: there could be multiple encodings for reasons unrelated to Python and to avoid the headache, use Unicode to represent text: convert as soon as possible encoded text to Unicode on input, and encode it to bytes (possibly using a different encoding) as late as possible on output — this is so called the concept of Unicode sandwich.

how do I manage to use the wrong encoding in this simple piece of code?

  1. Your first code example is not fine. You use non-ascii literal characters in a byte string on Python 2 that you should not do. Use bytestrings' literals only for binary data (or so called native strings if necessary). The code may produce mojibake such as I need 20 000Γé¼. (notice the character noise) if you run it using Python 2 in any environment that does not use utf-8-compatible encoding such as Windows console

  2. The second code example is ok assuming reload(sys) is not part of it. If you don't want to prefix all string literals with u''; you could use from __future__ import unicode_literals

Your actual issue is UnicodeEncodeError error and reload(sys) is not the right solution!
The correct solution is to configure your locale properly on POSIX (LANG, LC_CTYPE) or set PYTHONIOENCODING envvar if the output is redirected to a pipe/file or install win-unicode-console to print Unicode to Windows console.

like image 143
jfs Avatar answered Nov 20 '22 12:11

jfs


I have noticed the same behaviour of some standard code (mailman library). Thanks for your analysis, it helped me save some time. :-) The problem is exactly the same. My system uses sys.getdefaultencoding() and gets ascii, which is inappropriate to handle a list of 1000 UTF-8 encoded names.

There is a mismatch between stdin/stdout and even filesystem encoding (utf-8) on one hand and "defaultencoding" on the other (ascii). This thread: How to print UTF-8 encoded text to the console in Python < 3? seems to indicate that it is well known and Changing default encoding of Python? contains some indication that a more homogeneous (like "utf-8 everywhere") would break other things like the hash implementation.

For that reason it is also not straightforward to change the defaultencoding. (See http://blog.ianbicking.org/illusive-setdefaultencoding.html for various ways to do so.) It is removed from the sys instance in the site.py file.

like image 38
Dirk Avatar answered Nov 20 '22 11:11

Dirk