I need to parse and output some data in table-like format. The input is in unicode encoding. Here is the test script:
#!/usr/bin/env python
s1 = u'abcd'
s2 = u'\u03b1\u03b2\u03b3\u03b4'
print '1234567890'
print '%5s' % s1
print '%5s' % s2
It works as expected in case of the simple call like test.py:
1234567890 abcd αβγδ
But if I try to redirect the output to the file test.py > a.txt, I getting error:
Traceback (most recent call last):
File "./test.py", line 8, in
print '%5s' % s2
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)
If I convert strings to UTF-8 encoding, like s2.encode('utf8') redirection works fine, but data positions are broken:
1234567890 abcd αβγδ
How to force it to work properly in both cases?
It boils down to your output stream encoding. In this particular case, since you're using print, the output file used is sys.stdout.
stdout not redirectedWhen you run Python in the interactive mode, or when you don't redirect stdout to a file, Python uses encoding based on the environment, namely locale environment variables, like LC_CTYPE. For example, if you run your program like this:
$ LC_CTYPE='en_US' python test.py
...
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)
it will use ANSI_X3.4-1968 for sys.stdout (see sys.stdout.encoding) and fail. However, is you use UTF-8 (as you obviously already do):
$ LC_CTYPE='en_US.UTF-8' python test.py
1234567890
abcd
αβγδ
you'll get the expected output.
stdout redirected to fileWhen you redirect stdout to a file, Python will not try to detect encoding from your environment locale, but it will check another environment variable, PYTHONIOENCODING (check the source, initstdio() in Python/pylifecycle.c). For example, this will work as expected:
$ PYTHONIOENCODING=utf-8 python test.py >/tmp/output
since Python will use UTF-8 encoding for /tmp/output file.
stdout encoding overrideYou can also manually re-open sys.stdout with the desired encoding (check this and this SO question):
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
Now print will correctly output str and unicode objects, since the underlying stream writer will convert them to the UTF-8 on fly.
Of course, you can also manually encode each unicode to UTF-8 str prior to output with:
print ('%5s' % s2).encode('utf8')
but that's tedious and error-prone.
For completeness: when opening files for writing with a specific encoding (like UTF-8) in Python 2, you should use either io.open or codecs.open because they allow you to specify the encoding (see this question), unlike the built-in open:
from codecs import open
myfile = open('filename', encoding='utf-8')
or:
from io import open
myfile = open('filename', encoding='utf-8')
You should encode '%5s' % s2 not s2. So the following will have the expected output:
print ('%5s' % s2).encode('utf8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With