I'm trying to find a generic solution to print unicode strings from a python script.
The requirements are that it must run in both python 2.7 and 3.x, on any platform, and with any terminal settings and environment variables (e.g. LANG=C or LANG=en_US.UTF-8).
The python print function automatically tries to encode to the terminal encoding when printing, but if the terminal encoding is ascii it fails.
For example, the following works when the environment "LANG=enUS.UTF-8":
x = u'\xea'
print(x)
But it fails in python 2.7 when "LANG=C":
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 0: ordinal not in range(128)
The following works regardless of the LANG setting, but would not properly show unicode characters if the terminal was using a different unicode encoding:
print(x.encode('utf-8'))
The desired behavior would be to always show unicode in the terminal if it is possible and show some encoding if the terminal does not support unicode. For example, the output would be UTF-8 encoded if the terminal only supported ascii. Basically, the goal is to do the same thing as the python print function when it works, but in the cases where the print function fails, use some default encoding.
You can handle the LANG=C
case by telling sys.stdout
to default to UTF-8 in cases when it would otherwise default to ASCII.
import sys, codecs
if sys.stdout.encoding is None or sys.stdout.encoding == 'ANSI_X3.4-1968':
utf8_writer = codecs.getwriter('UTF-8')
if sys.version_info.major < 3:
sys.stdout = utf8_writer(sys.stdout, errors='replace')
else:
sys.stdout = utf8_writer(sys.stdout.buffer, errors='replace')
print(u'\N{snowman}')
The above snippet fulfills your requirements: it works in Python 2.7 and 3.4, and it doesn't break when LANG
is in a non-UTF-8 setting such as C
.
It is not a new technique, but it's surprisingly hard to find in the documentation. As presented above, it actually respects non-UTF-8 settings such as ISO 8859-*
. It only defaults to UTF-8 if Python would have bogusly defaulted to ASCII, breaking the application.
I don't think you should try and solve this at the Python level. Document your application requirements, log the locale of systems you run on so it can be included in bug reports and leave it at that.
If you do want to go this route, at least distinguish between terminals and pipes; you should never output data to a terminal that the terminal cannot explicitly handle; don't output UTF-8 for example, as the non-printable codepoints > U+007F could end up being interpreted as control codes when encoded.
For a pipe, output UTF-8 by default and make it configurable.
So you'd detect if a TTY is being used, then handle encoding based on that; for a terminal, set an error handler (pick one of replace
or backslashreplace
to provide replacement characters or escape sequences for whatever characters cannot be handled). For a pipe, use a configurable codec.
import codecs
import os
import sys
if os.isatty(sys.stdout.fileno()):
output_encoding = sys.stdout.encoding
errors = 'replace'
else:
output_encoding = 'utf-8' # allow override from settings
errors = None # perhaps parse from settings, not needed for UTF8
sys.stdout = codecs.getwriter(output_encoding)(sys.stdout, errors=errors)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With