While porting code from Python 2 to Python 3, I run into this problem when reading UTF-8 text from standard input. In Python 2, this works fine:
for line in sys.stdin: ...
But Python 3 expects ASCII from sys.stdin, and if there are non-ASCII characters in the input, I get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte .. in position ..: ordinal not in range(128)
For a regular file, I would specify the encoding when opening the file:
with open('filename', 'r', encoding='utf-8') as file: for line in file: ...
But how can I specify the encoding for standard input? Other SO posts (e.g. How to change the stdin encoding on python) have suggested using
input_stream = codecs.getreader('utf-8')(sys.stdin) for line in input_stream: ...
However, this doesn't work in Python 3. I still get the same error message. I'm using Ubuntu 12.04.2 and my locale is set to en_US.UTF-8.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
String Encoding Since Python 3.0, strings are stored as Unicode, i.e. each character in the string is represented by a code point. So, each string is just a sequence of Unicode code points. For efficient storage of these strings, the sequence of code points is converted into a set of bytes.
The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence. The most systematic way to accomplish this is to make your code into a Unicode-only clean room.
Python 3 does not expect ASCII from sys.stdin
. It'll open stdin
in text mode and make an educated guess as to what encoding is used. That guess may come down to ASCII
, but that is not a given. See the sys.stdin
documentation on how the codec is selected.
Like other file objects opened in text mode, the sys.stdin
object derives from the io.TextIOBase
base class; it has a .buffer
attribute pointing to the underlying buffered IO instance (which in turn has a .raw
attribute).
Wrap the sys.stdin.buffer
attribute in a new io.TextIOWrapper()
instance to specify a different encoding:
import io import sys input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
Alternatively, set the PYTHONIOENCODING
environment variable to the desired codec when running python.
From Python 3.7 onwards, you can also reconfigure the existing std*
wrappers, provided you do it at the start (before any data has been read):
# Python 3.7 and newer sys.stdin.reconfigure(encoding='utf-8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With