Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: How do I force iso-8859-1 file output?

How do I force Latin-1 (which I guess means iso-8859-1?) file output in Python?

Here's my code at the moment. It works, but trying to import the resulting output file into a Latin-1 MySQL table produces weird encoding errors.

outputFile = file( "textbase.tab", "w" )
for k, v in textData.iteritems():
    complete_line = k + '~~~~~' + v + '~~~~~' + " ENDOFTHELINE"
    outputFile.write(complete_line)
    outputFile.write( "\n" )
outputFile.close()

The resulting output file seems to be saved in "Western (Mac OS Roman)", but if I then save it in Latin-1, I still get strange encoding problems. How can I make sure that the strings used, and the file itself, are all encoded in Latin-1 as soon as they are generated?

The original strings (in the textData dictionary) have been parsed in from an RTF file - I don't know if that makes a difference.

I'm a bit new to Python and to encoding generally, so apologies if this is a dumb question. I have tried looking at the docs but haven't got very far.

I'm using Python 2.6.1.

like image 215
AP257 Avatar asked Feb 03 '10 12:02

AP257


People also ask

What is the difference between UTF-8 and ISO 8859 1?

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

Why is it considered a good practice to mention character encoding while opening a file in Python?

We need to specify a character encoding because — gasp! — computers don't actually know what text is. Character encodings are systems that map characters to numbers. Each character is given a specific ID number.


2 Answers

Simply use the codecs module for writing the file:

import codecs
outputFile = codecs.open("textbase.tab", "w", "ISO-8859-1")

Of course, the strings you write have to be Unicode strings (type unicode), they won't be converted if they are plain str objects (which are basically just arrays of bytes). I guess you are reading the RTF file with the normal Python file object as well, so you might have to convert that to using codecs.open as well.

like image 136
Torsten Marek Avatar answered Oct 11 '22 17:10

Torsten Marek


For me, io.open works a bit faster on python 2.7 for writes, and an order of magnitude faster for reads:

import io
with io.open("textbase.tab", "w", encoding="ISO-8859-1") as outputFile:
    ...

In python 3, you can just pass the encoding keyword arg to open.

like image 44
beardc Avatar answered Oct 11 '22 17:10

beardc