Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding Python Unicode and Linux terminal

I have a Python script that writes some strings with UTF-8 encoding. In my script I am using mainly the str() function to cast to string. It looks like that:

mystring="this is unicode string:"+japanesevalues[1] 
#japanesevalues is a list of unicode values, I am sure it is unicode
print mystring

I don't use the Python terminal, just the standard Linux Red Hat x86_64 terminal. I set the terminal to output utf8 chars.

If I execute this:

#python myscript.py
this is unicode string: カラダーズ ソフィー

But if I do that:

#python myscript.py > output

I got the typical error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 253-254: ordinal not in range(128)

Why is that?

like image 555
Cesc Avatar asked Jul 02 '13 06:07

Cesc


People also ask

How does Unicode work Python?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.

What Unicode does Linux use?

The Linux kernel code has been rewritten to use Unicode to map characters to fonts. By downloading a single Unicode-to-font table, both the eight-bit character sets and UTF-8 mode are changed to use the font as indicated.

How do I know if Unicode supports terminal?

Really, the surefire way to test is to download a text file and cat it in the terminal and see if everything looks ok. or, if you can, recompile the terminal enabling the unicode option (assuming it has one).

How do I use Unicode characters in terminal?

Press and hold the Left Ctrl and Shift keys and hit the U key. You should see the underscored u under the cursor. Type then the Unicode code of the desired character and press Enter. Voila!


2 Answers

The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.

But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set. You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.

Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).

In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).

like image 150
Lennart Regebro Avatar answered Sep 28 '22 18:09

Lennart Regebro


If it outputs to the terminal then Python can examine the value of $LANG to pick a charset. All bets are off if you redirect.

like image 30
Ignacio Vazquez-Abrams Avatar answered Sep 28 '22 19:09

Ignacio Vazquez-Abrams