Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python standard idiom to set sys.stdout buffer to zero doesn't work with Unicode

When I'm writing sysadmin scripts in Python, the buffer on sys.stdout that effects every call to print() is annoying, because I don't want to wait for a buffer to be flushed and then get a big chunk of lines at once on the screen, instead I want to get individually lines of output as soon as new output is generated by the script. I don't even want to wait for newlines so see the output.

An often used idiom to do this in python is

import os
import sys
sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)

This worked fine for me for a long time. Now I noticed, that it doesn't work with Unicode. Please see the following script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import print_function, unicode_literals

import os
import sys

print('Original encoding: {}'.format(sys.stdout.encoding))
sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)
print('New encoding: {}'.format(sys.stdout.encoding))

text = b'Eisb\xe4r'
print(type(text))
print(text)

text = text.decode('latin-1')
print(type(text))
print(text)

This leads to the following output:

Original encoding: UTF-8
New encoding: None
<type 'str'>
Eisb▒r
<type 'unicode'>
Traceback (most recent call last):
  File "./export_debug.py", line 18, in <module>
    print(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 4: ordinal not in range(128)

It took me hours to track down the reason for it (my original script was much longer than this minimal debugging script). It is the line

sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)

which I used for years so didn't expect any problem with it. Just comment out this line and the correct output should look like this:

Original encoding: UTF-8
New encoding: UTF-8
<type 'str'>
Eisb▒r
<type 'unicode'>
Eisbär

So what is the script ment to do? To prepare my Python 2.7 code as close as possible to Python 3.x, I'm always using

from __future__ import print_function, unicode_literals

which makes python use the new print()-function but more important: it makes Python store all strings as Unicode internally by default. I have a lot of Latin-1 / ISO-8859-1 encoded data, for example

text = b'Eisb\xe4r'

To work with it the intended way, I need to decode it to Unicode first, that's what

text = text.decode('latin-1')

is for. As the default encoding is UTF-8 on my system, whenever I print a string, python encodes the internal Unicode string to UTF-8 then. But first it has to be in perfect Unicode internally.

Now that all works fine in general, just not with a zero byte output buffer so far. Any ideas? I noticed that sys.stdout.encoding is unset after the zero-buffering line, but I don't know how to set it again. It is a read-only attribute and the OS environment variables LC_ALL or LC_CTYPE seem to be evaluated only at the start of the python interpreter.

Btw.: 'Eisbär' is the German word for 'polar bear'.

like image 465
Marten Lehmann Avatar asked Oct 10 '12 17:10

Marten Lehmann


People also ask

How do I clear stdout buffer in Python?

Calling sys. stdout. flush() forces it to "flush" the buffer, meaning that it will write everything in the buffer to the terminal, even if normally it would wait before doing so.

What does Sys stdout do in Python?

stdout. A built-in file object that is analogous to the interpreter's standard output stream in Python. stdout is used to display output directly to the screen console.

What is SYS __ stdout __?

In IDLE, sys. __stdout__ is the program's original standard output - which goes nowhere, since it's not a console application. In other words, IDLE itself has already replaced sys. stdout with something else (its own console window), so you're taking two steps backwards by replacing your own stdout with __stdout__ .


1 Answers

The print function uses a special flag when writing to a file object, causing the PyFile_WriteObject function of the Python C API to retrieve the output encoding to do the unicode-to-bytes conversion, and by replacing the stdout stream you lost the encoding. Unfortunately, you cannot explicitly set it again:

encoding = sys.stdout.encoding
sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)
sys.stdout.encoding = encoding  # Raises a TypeError; readonly attribute

You also cannot use the io.open function instead, since it doesn't allow buffering to be disabled if you want to be able to use the encoding option you'd require.

The proper way to have the print function flush immediately is to use the flush=True keyword:

print(something, flush=True)

If that's too tedious to add everywhere, consider using a custom print function:

def print(*args, **kw):
    flush = kw.pop('flush', True)  # Python 2.7 doesn't support the flush keyword..   
    __builtins__.print(*args, **kw)
    if flush:
        sys.stdout.flush()

Since Python 2.7's print() function doesn't actually support the flush keyword yet (botheration), you can simulate that by adding an explicit flush instead in that custom version.

like image 177
Martijn Pieters Avatar answered Oct 15 '22 10:10

Martijn Pieters