I'm writing a python program which upper-cases all input (a replacement for the non-working tr '[:lowers:]' '[:upper:]'
). The locale is ru_RU.UTF-8
and I use PYTHONIOENCODING=UTF-8
to set the STDIN/STDOUT encodings. This correctly sets sys.stdin.encoding
. So, why do I still need to explicitly create a decoding wrapper if sys.stdin
already knows the encoding? If I don't create the wrapping reader, the .upper()
function doesn't work correctly (does nothing for non-ASCII characters).
import sys, codecs
sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin) #Why do I need this?
for line in sys.stdin:
sys.stdout.write(line.upper())
Why does stdin
have .encoding
if it doesn't use it?
To answer "why", we need to understand Python 2.x's built-in file
type, file.encoding
, and their relationship.
The built-in file
object deals with raw bytes---always reads and writes raw bytes.
The encoding
attribute describes the encoding of the raw bytes in the stream. This attribute may or may not be present, and may not even be reliable (e.g. we set PYTHONIOENCODING
incorrectly in the case of standard streams).
The only time any automatic conversion is performed by file
objects is when writing unicode
object to that stream. In that case it will use the file.encoding
if available to perform the conversion.
In the case of reading data, the file object will not do any conversion because it returns raw bytes. The encoding
attribute in this case is a hint for the user to perform conversions manually.
file.encoding
is set in your case because you set the PYTHONIOENCODING
variable and the sys.stdin
's encoding
attribute was set accordingly. To get a text stream we have to wrap it manually as you have done in your example code.
To think about it another way, imagine that we didn't have a separate text type (like Python 2.x's unicode
or Python 3's str
). We can still work with text by using raw bytes, but keeping track of the encoding used. This is kind of how the file.encoding
is meant to be used (to be used for tracking the encoding). The reader wrappers that we create automatically does the tracking and conversions for us.
Of course, automatically wrapping sys.stdin
would be nicer (and that is what Python 3.x does), but changing the default behaviour of sys.stdin
in Python 2.x will break backwards compatibility.
The following is a comparison of sys.stdin
in Python 2.x and 3.x:
# Python 2.7.4
>>> import sys
>>> type(sys.stdin)
<type 'file'>
>>> sys.stdin.encoding
'UTF-8'
>>> w = sys.stdin.readline()
## ... type stuff - enter
>>> type(w)
<type 'str'> # In Python 2.x str is just raw bytes
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
The io.TextIOWrapper
class is part of the standard library since Python 2.6. This class has an encoding
attribute that is used to convert raw bytes to-and-from Unicode.
# Python 3.3.1
>>> import sys
>>> type(sys.stdin)
<class '_io.TextIOWrapper'>
>>> sys.stdin.encoding
'UTF-8'
>>> w = sys.stdin.readline()
## ... type stuff - enter
>>> type(w)
<class 'str'> # In Python 3.x str is Unicode
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
The buffer
attribute provides access to the raw byte stream backing stdin
; this is usually a BufferedReader
. Note below that it does not have an encoding
attribute.
# Python 3.3.1 again
>>> type(sys.stdin.buffer)
<class '_io.BufferedReader'>
>>> w = sys.stdin.buffer.readline()
## ... type stuff - enter
>>> type(w)
<class 'bytes'> # bytes is (kind of) equivalent to Python 2 str
>>> sys.stdin.buffer.encoding
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: '_io.BufferedReader' object has no attribute 'encoding'
In Python 3 the presence or absence of the encoding
attribute is consistent with the type of stream used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With