Suppose I am reading a file containing 3 comma separated numbers. The file was saved with with an unknown encoding, so far I am dealing with ANSI and UTF-8. If the file was in UTF-8 and it had 1 row with values 115,113,12 then:
with open(file) as f:
a,b,c=map(int,f.readline().split(','))
would throw this:
invalid literal for int() with base 10: '\xef\xbb\xbf115'
The first number is always mangled with these '\xef\xbb\xbf' characters. For the rest 2 numbers the conversion works fine. If I manually replace '\xef\xbb\xbf' with '' and then do the int conversion it will work.
Is there a better way of doing this for any type of encoded file?
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.
UTF-8 treats numbers 0-127 as ASCII, 192-247 as Shift keys, and 128-192 as the key to be shifted. For instance, characters 208 and 209 shift you into the Cyrillic range. 208 followed by 175 is character 1071, the Cyrillic Я.
import codecs
with codecs.open(file, "r", "utf-8-sig") as f:
a, b, c= map(int, f.readline().split(","))
This works in Python 2.6.4. The codecs.open
call opens the file and returns data as unicode, decoding from UTF-8 and ignoring the initial BOM.
What you're seeing is a UTF-8 encoded BOM, or "Byte Order Mark". The BOM is not usually used for UTF-8 files, so the best way to handle it might be to open the file with a UTF-8 codec, and skip over the U+FEFF
character if present.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With