Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write utf8 to standard output in a way that works with python2 and python3

I want to write a non-ascii character, lets say to standard output. The tricky part seems to be that some of the data that I want to concatenate to that string is read from json. Consider the follwing simple json document:

{"foo":"bar"}

I include this because if I just want to print then it seems enough to simply write:

print("→")

and it will do the right thing in python2 and python3.

So I want to print the value of foo together with my non-ascii character . The only way I found to do this such that it works in both, python2 and python3 is:

getattr(sys.stdout, 'buffer', sys.stdout).write(data["foo"].encode("utf8")+u"→".encode("utf8"))

or

getattr(sys.stdout, 'buffer', sys.stdout).write((data["foo"]+u"→").encode("utf8"))

It is important to not miss the u in front of because otherwise a UnicodeDecodeError will be thrown by python2.

Using the print function like this:

print((data["foo"]+u"→").encode("utf8"), file=(getattr(sys.stdout, 'buffer', sys.stdout)))

doesnt seem to work because python3 will complain TypeError: 'str' does not support the buffer interface.

Did I find the best way or is there a better option? Can I make the print function work?

like image 414
josch Avatar asked May 30 '14 00:05

josch


2 Answers

The most concise I could come up with is the following, which you may be able to make more concise with a few convenience functions (or even replacing/overriding the print function):

# -*- coding=utf-8 -*-
import codecs
import os
import sys

# if you include the -*- coding line, you can use this
output = 'bar' + u'→'
# otherwise, use this
output = 'bar' + b'\xe2\x86\x92'.decode('utf-8')

if sys.stdout.encoding == 'UTF-8':
    print(output)
else:
    output += os.linesep
    if sys.version_info[0] >= 3:
        sys.stdout.buffer.write(bytes(output.encode('utf-8')))
    else:
        codecs.getwriter('utf-8')(sys.stdout).write(output)

The best option is using the -*- encoding line, which allows you to use the actual character in the file. But if for some reason, you can't use the encoding line, it's still possible to accomplish without it.

This (both with and without the encoding line) works on Linux (Arch) with python 2.7.7 and 3.4.1. It also works if the terminal's encoding is not UTF-8. (On Arch Linux, I just change the encoding by using a different LANG environment variable.)

LANG=zh_CN python test.py

It also sort of works on Windows, which I tried with 2.6, 2.7, 3.3, and 3.4. By sort of, I mean I could get the '→' character to display only on a mintty terminal. On a cmd terminal, that character would display as 'ΓåÆ'. (There may be something simple I'm missing there.)

like image 135
snapshoe Avatar answered Oct 23 '22 15:10

snapshoe


If you don't need to print to sys.stdout.buffer, then the following should print fine to sys.stdout. I tried it in both Python 2.7 and 3.4, and it seemed to work fine:

# -*- coding=utf-8 -*-
print("bar" + u"→")
like image 42
Addison Avatar answered Oct 23 '22 15:10

Addison