Here are 2 code samples, Python3 : the first one writes two files with latin1 encoding :
s='On écrit ça dans un fichier.'
with open('spam1.txt', 'w',encoding='ISO-8859-1') as f:
print(s, file=f)
with open('spam2.txt', 'w',encoding='ISO-8859-1') as f:
f.write(s)
The second one reads the same files with the same encoding :
with open('spam1.txt', 'r',encoding='ISO-8859-1') as f:
s1=f.read()
with open('spam2.txt', 'r',encoding='ISO-8859-1') as f:
s2=f.read()
Now, printing s1 and s2 I get
On écrit ça dans un fichier.
instead of the initial "On écrit ça dans un fichier."
What is wrong ? I also tried with io.open but I miss something. The funny part is that I had no such problem with Python2.7 and its str.decode method which is now gone...
Could someone help me ?
The latin-1 encoding in Python implements ISO_8859-1:1987 which maps all possible byte values to the first 256 Unicode code points, and thus ensures decoding errors will never occur regardless of the configured error handler.
This is a type of encoding and is used to solve the UnicodeDecodeError, while attempting to read a file in Python or Pandas. latin-1 is a single-byte encoding which uses the characters 0 through 127, so it can encode half as many characters as latin1.
Your data was written out as UTF-8:
>>> 'On écrit ça dans un fichier.'.encode('utf8').decode('latin1')
'On écrit ça dans un fichier.'
This either means you did not write out Latin-1 data, or your source code was saved as UTF-8 but you declared your script (using a PEP 263-compliant header to be Latin-1 instead.
If you saved your Python script with a header like:
# -*- coding: latin-1 -*-
but your text editor saved the file with UTF-8 encoding instead, then the string literal:
s='On écrit ça dans un fichier.'
will be misinterpreted by Python as well, in the same manner. Saving the resulting unicode value to disk as Latin-1, then reading it again as Latin-1 will preserve the error.
To debug, please take a close look at print(s.encode('unicode_escape'))
in the first script. If it looks like:
b'On \\xc3\\xa9crit \\xc3\\xa7a dans un fichier.'
then your source code encoding and the PEP-263 header are disagreeing on how the source code should be interpreted. If your source code is correctly decoded the correct output is:
b'On \\xe9crit \\xe7a dans un fichier.'
If Spyder is stubbornly ignoring the PEP-263 header and reading your source as Latin-1 regardless, avoid using non-ASCII characters and use escape codes instead; either using \uxxxx
unicode code points:
s = 'On \u00e9crit \u007aa dans un fichier.'
or \xaa
one-byte escape codes for code-points below 256:
s = 'On \xe9crit \x7aa dans un fichier.'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With