Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do I have to do `sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin)`?

I'm writing a python program which upper-cases all input (a replacement for the non-working tr '[:lowers:]' '[:upper:]'). The locale is ru_RU.UTF-8 and I use PYTHONIOENCODING=UTF-8 to set the STDIN/STDOUT encodings. This correctly sets sys.stdin.encoding. So, why do I still need to explicitly create a decoding wrapper if sys.stdin already knows the encoding? If I don't create the wrapping reader, the .upper() function doesn't work correctly (does nothing for non-ASCII characters).

import sys, codecs
sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin) #Why do I need this?
for line in sys.stdin:
    sys.stdout.write(line.upper())

Why does stdin have .encoding if it doesn't use it?

like image 751
Ark-kun Avatar asked Apr 03 '13 02:04

Ark-kun


1 Answers

To answer "why", we need to understand Python 2.x's built-in file type, file.encoding, and their relationship.

The built-in file object deals with raw bytes---always reads and writes raw bytes.

The encoding attribute describes the encoding of the raw bytes in the stream. This attribute may or may not be present, and may not even be reliable (e.g. we set PYTHONIOENCODING incorrectly in the case of standard streams).

The only time any automatic conversion is performed by file objects is when writing unicode object to that stream. In that case it will use the file.encoding if available to perform the conversion.

In the case of reading data, the file object will not do any conversion because it returns raw bytes. The encoding attribute in this case is a hint for the user to perform conversions manually.

file.encoding is set in your case because you set the PYTHONIOENCODING variable and the sys.stdin's encoding attribute was set accordingly. To get a text stream we have to wrap it manually as you have done in your example code.

To think about it another way, imagine that we didn't have a separate text type (like Python 2.x's unicode or Python 3's str). We can still work with text by using raw bytes, but keeping track of the encoding used. This is kind of how the file.encoding is meant to be used (to be used for tracking the encoding). The reader wrappers that we create automatically does the tracking and conversions for us.

Of course, automatically wrapping sys.stdin would be nicer (and that is what Python 3.x does), but changing the default behaviour of sys.stdin in Python 2.x will break backwards compatibility.

The following is a comparison of sys.stdin in Python 2.x and 3.x:

# Python 2.7.4
>>> import sys
>>> type(sys.stdin)
<type 'file'>
>>> sys.stdin.encoding
'UTF-8'
>>> w = sys.stdin.readline()
## ... type stuff - enter
>>> type(w)
<type 'str'>           # In Python 2.x str is just raw bytes
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')

The io.TextIOWrapper class is part of the standard library since Python 2.6. This class has an encoding attribute that is used to convert raw bytes to-and-from Unicode.

# Python 3.3.1
>>> import sys
>>> type(sys.stdin)
<class '_io.TextIOWrapper'>
>>> sys.stdin.encoding
'UTF-8'
>>> w = sys.stdin.readline()
## ... type stuff - enter
>>> type(w)
<class 'str'>        # In Python 3.x str is Unicode
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')

The buffer attribute provides access to the raw byte stream backing stdin; this is usually a BufferedReader. Note below that it does not have an encoding attribute.

# Python 3.3.1 again
>>> type(sys.stdin.buffer)
<class '_io.BufferedReader'>
>>> w = sys.stdin.buffer.readline()
## ... type stuff - enter
>>> type(w)
<class 'bytes'>      # bytes is (kind of) equivalent to Python 2 str
>>> sys.stdin.buffer.encoding
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: '_io.BufferedReader' object has no attribute 'encoding'

In Python 3 the presence or absence of the encoding attribute is consistent with the type of stream used.

like image 98
finiteint Avatar answered Oct 30 '22 22:10

finiteint