I have this myfile (which I have pasted, I hope the relevant data with the problems has survived the copy/pasting). I try to read that file with:
import codecs
codecs.open('myfile', 'r', 'utf-8').read()
But this gives:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 7128: invalid continuation byte
If I check the file:
» file myfile
myfile: C source, ISO-8859 text
Lots of times I am dealing with files which have not been generated by me (system files, random files downloaded from the internet, random files contributed by providers, customers, ...): those files do not provide a clue of the encoding they are using. Being in a multi-cultural environment (Europe), it is difficult to know how those files have been encoded. Most of the times, even the person providing the files has no clue about encoding, which can be happening behind the scenes by the editor/tool of choice. How to be sure about the encoding being used, on a file-by-file basis?
With python 3.3 you can use the built in open function
open("myfile",encoding="ISO-8859-1")
You change the codec in the open() command; the ISO-8859 standard has multiple codecs, I picked Latin-1 for you here, but you may need to pick another one:
codecs.open('myfile', 'r', 'iso-8859-1').read()
See the codecs module for a list of valid codecs. Judging by the pastie data, iso-8859-1 is the correct codec to use, as it is suited for Scandinavian text.
Generally, without other sources, you cannot know what codec a file uses. At best, you can guess (which is what file does).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With