Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a "C source, ISO-8859 text"

Tags:

python

unicode

I have this myfile (which I have pasted, I hope the relevant data with the problems has survived the copy/pasting). I try to read that file with:

import codecs
codecs.open('myfile', 'r', 'utf-8').read()

But this gives:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 7128: invalid continuation byte

If I check the file:

» file myfile
myfile: C source, ISO-8859 text
  • How can I read that kind of file (ISO-8859) in python?
  • In the general case, how can I know how a file is encoded?

Lots of times I am dealing with files which have not been generated by me (system files, random files downloaded from the internet, random files contributed by providers, customers, ...): those files do not provide a clue of the encoding they are using. Being in a multi-cultural environment (Europe), it is difficult to know how those files have been encoded. Most of the times, even the person providing the files has no clue about encoding, which can be happening behind the scenes by the editor/tool of choice. How to be sure about the encoding being used, on a file-by-file basis?

like image 978
blueFast Avatar asked Jun 02 '13 13:06

blueFast


2 Answers

With python 3.3 you can use the built in open function

open("myfile",encoding="ISO-8859-1")
like image 147
David Michael Gang Avatar answered Oct 20 '22 13:10

David Michael Gang


You change the codec in the open() command; the ISO-8859 standard has multiple codecs, I picked Latin-1 for you here, but you may need to pick another one:

codecs.open('myfile', 'r', 'iso-8859-1').read()

See the codecs module for a list of valid codecs. Judging by the pastie data, iso-8859-1 is the correct codec to use, as it is suited for Scandinavian text.

Generally, without other sources, you cannot know what codec a file uses. At best, you can guess (which is what file does).

like image 19
Martijn Pieters Avatar answered Oct 20 '22 11:10

Martijn Pieters