Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with UTF-8 numbers in Python

Suppose I am reading a file containing 3 comma separated numbers. The file was saved with with an unknown encoding, so far I am dealing with ANSI and UTF-8. If the file was in UTF-8 and it had 1 row with values 115,113,12 then:

with open(file) as f:
    a,b,c=map(int,f.readline().split(','))

would throw this:

invalid literal for int() with base 10: '\xef\xbb\xbf115'

The first number is always mangled with these '\xef\xbb\xbf' characters. For the rest 2 numbers the conversion works fine. If I manually replace '\xef\xbb\xbf' with '' and then do the int conversion it will work.

Is there a better way of doing this for any type of encoded file?

like image 833
Ηλίας Avatar asked Mar 01 '10 23:03

Ηλίας


People also ask

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

What is decode (' UTF-8 ') in Python?

decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.

Does UTF-8 include numbers?

UTF-8 treats numbers 0-127 as ASCII, 192-247 as Shift keys, and 128-192 as the key to be shifted. For instance, characters 208 and 209 shift you into the Cyrillic range. 208 followed by 175 is character 1071, the Cyrillic Я.


2 Answers

import codecs

with codecs.open(file, "r", "utf-8-sig") as f:
    a, b, c= map(int, f.readline().split(","))

This works in Python 2.6.4. The codecs.open call opens the file and returns data as unicode, decoding from UTF-8 and ignoring the initial BOM.

like image 57
tzot Avatar answered Sep 20 '22 00:09

tzot


What you're seeing is a UTF-8 encoded BOM, or "Byte Order Mark". The BOM is not usually used for UTF-8 files, so the best way to handle it might be to open the file with a UTF-8 codec, and skip over the U+FEFF character if present.

like image 44
Greg Hewgill Avatar answered Sep 20 '22 00:09

Greg Hewgill