I recently ran into some problems decoding a handle (with errors mapping 0x81, 0x8D) from the Biopython module with an anaconda 4.1.1 python 3.5.2 installation on a sony vaio windows 10 system
After some research, it seems that possibly the problem may be that the default decoding codec is cp1252. I ran the code below and found that indeed the default codec is set to cp1252.
However, several posts suggest that python 3 should have set the default codec to utf8. Is that correct? If so, why is mine cp1252 and how can I solve this?
import locale
os_encoding = locale.getpreferredencoding()
By default in Python 3, we are on the left side in the world of Unicode code points for strings. We only need to go back and forth with bytes while writing or reading the data. Default encoding during this conversion is UTF-8, but other encodings can also be used.
Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German.
String EncodingThe process is known as encoding. There are various encodings present which treat a string differently. The popular encodings being utf-8, ascii, etc. Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.
According to What’s New In Python 3.0,
There is a platform-dependent default encoding […] In many cases, but not all, the system default is UTF-8; you should never count on this default.
and
PEP 3120: The default source encoding is now UTF-8.
In other words, Python opens source files as UTF-8 by default, but any interaction with the filesystem will depend on the environment. It's strongly recommended to use open(filename, encoding='utf-8')
to read a file.
Another change is that b'bytes'.decode()
and 'str'.encode()
with no argument use utf-8 instead of ascii.
Python 3.6 changes some more defaults:
PEP 529: Change Windows filesystem encoding to UTF-8
PEP 528: Change Windows console encoding to UTF-8
But the default encoding for open()
is still whatever Python manages to infer from the environment.
It appears that 3.7 will add an (opt-in!) mode where the environmental locale encoding is ignored, and everything is all UTF-8 all the time (except for specific cases where Windows uses UTF-16, I suppose). See PEP 0540 and corresponding Issue 29240.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With