Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing and then reading a string in file encoded in latin1

Tags:

python

io

latin1

Here are 2 code samples, Python3 : the first one writes two files with latin1 encoding :

s='On écrit ça dans un fichier.'
with open('spam1.txt', 'w',encoding='ISO-8859-1') as f:
    print(s, file=f)
with open('spam2.txt', 'w',encoding='ISO-8859-1') as f:
    f.write(s)

The second one reads the same files with the same encoding :

with open('spam1.txt', 'r',encoding='ISO-8859-1') as f:
    s1=f.read()
with open('spam2.txt', 'r',encoding='ISO-8859-1') as f:
    s2=f.read()

Now, printing s1 and s2 I get

On écrit ça dans un fichier.

instead of the initial "On écrit ça dans un fichier."

What is wrong ? I also tried with io.open but I miss something. The funny part is that I had no such problem with Python2.7 and its str.decode method which is now gone...

Could someone help me ?

like image 870
François Coulombeau Avatar asked Jul 22 '13 14:07

François Coulombeau


People also ask

What is encoding =' latin1 in Python?

The latin-1 encoding in Python implements ISO_8859-1:1987 which maps all possible byte values to the first 256 Unicode code points, and thus ensures decoding errors will never occur regardless of the configured error handler.

Why do we use encoding in latin1?

This is a type of encoding and is used to solve the UnicodeDecodeError, while attempting to read a file in Python or Pandas. latin-1 is a single-byte encoding which uses the characters 0 through 127, so it can encode half as many characters as latin1.


1 Answers

Your data was written out as UTF-8:

>>> 'On écrit ça dans un fichier.'.encode('utf8').decode('latin1')
'On écrit ça dans un fichier.'

This either means you did not write out Latin-1 data, or your source code was saved as UTF-8 but you declared your script (using a PEP 263-compliant header to be Latin-1 instead.

If you saved your Python script with a header like:

# -*- coding: latin-1 -*-

but your text editor saved the file with UTF-8 encoding instead, then the string literal:

s='On écrit ça dans un fichier.'

will be misinterpreted by Python as well, in the same manner. Saving the resulting unicode value to disk as Latin-1, then reading it again as Latin-1 will preserve the error.

To debug, please take a close look at print(s.encode('unicode_escape')) in the first script. If it looks like:

b'On \\xc3\\xa9crit \\xc3\\xa7a dans un fichier.'

then your source code encoding and the PEP-263 header are disagreeing on how the source code should be interpreted. If your source code is correctly decoded the correct output is:

b'On \\xe9crit \\xe7a dans un fichier.'

If Spyder is stubbornly ignoring the PEP-263 header and reading your source as Latin-1 regardless, avoid using non-ASCII characters and use escape codes instead; either using \uxxxx unicode code points:

s = 'On \u00e9crit \u007aa dans un fichier.'

or \xaa one-byte escape codes for code-points below 256:

s = 'On \xe9crit \x7aa dans un fichier.'
like image 97
Martijn Pieters Avatar answered Oct 01 '22 19:10

Martijn Pieters