Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to print UTF-8 encoded text to the console in Python < 3?

I'm running a recent Linux system where all my locales are UTF-8:

LANG=de_DE.UTF-8 LANGUAGE= LC_CTYPE="de_DE.UTF-8" LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" ... LC_IDENTIFICATION="de_DE.UTF-8" LC_ALL= 

Now I want to write UTF-8 encoded content to the console.

Right now Python uses UTF-8 for the FS encoding but sticks to ASCII for the default encoding :-(

>>> import sys >>> sys.getdefaultencoding() 'ascii' >>> sys.getfilesystemencoding() 'UTF-8' 

I thought the best (clean) way to do this was setting the PYTHONIOENCODING environment variable. But it seems that Python ignores it. At least on my system I keep getting ascii as default encoding, even after setting the envvar.

# tried this in ~/.bashrc and ~/.profile (also sourced them) # and on the commandline before running python export PYTHONIOENCODING=UTF-8 

If I do the following at the start of a script, it works though:

>>> import sys >>> reload(sys)  # to enable `setdefaultencoding` again <module 'sys' (built-in)> >>> sys.setdefaultencoding("UTF-8") >>> sys.getdefaultencoding() 'UTF-8' 

But that approach seems unclean. So, what's a good way to accomplish this?

Workaround

Instead of changing the default encoding - which is not a good idea (see mesilliac's answer) - I just wrap sys.stdout with a StreamWriter like this:

sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 

See this gist for a small utility function, that handles it.

like image 957
Brutus Avatar asked Jul 31 '12 13:07

Brutus


People also ask

How do you get the UTF-8 character code in Python?

UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point. In Python 2, chr only supports ASCII, so only numbers in the [0.. 255] range.

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.


1 Answers

It seems accomplishing this is not recommended.

Fedora suggested using the system locale as the default, but apparently this breaks other things.

Here's a quote from the mailing-list discussion:

 The only supported default encodings in Python are:   Python 2.x: ASCII  Python 3.x: UTF-8  If you change these, you are on your own and strange things will start to happen. The default encoding does not only affect the translation between Python and the outside world, but also all internal conversions between 8-bit strings and Unicode.  Hacks like what's happening in the pango module (setting the default encoding to 'utf-8' by reloading the site module in order to get the sys.setdefaultencoding() API back) are just downright wrong and will cause serious problems since Unicode objects cache their default encoded representation.  Please don't enable the use of a locale based default encoding.  If all you want to achieve is getting the encodings of stdout and stdin correctly setup for pipes, you should instead change the .encoding attribute of those (only).  --  Marc-Andre Lemburg eGenix.com 
like image 158
mesilliac Avatar answered Sep 24 '22 12:09

mesilliac