Suppose I am reading a file containing 3 comma separated numbers. The file was saved with with an unknown encoding, so far I am dealing with ANSI and UTF-8. If the file was in UTF-8 and it had 1 row with values 115,113,12 then: <pre class="prettyprint"><code>with open(file) as f: a,b,c=map(int,f.readline().split(',')) </code></pre> would throw this: <pre class="prettyprint"><code>invalid literal for int() with base 10: '\xef\xbb\xbf115' </code></pre> The first number is always mangled with these '\xef\xbb\xbf' characters. For the rest 2 numbers the conversion works fine. If I manually replace '\xef\xbb\xbf' with '' and then do the int conversion it will work. Is there a better way of doing this for any type of encoded file?

<pre class="prettyprint"><code>import codecs with codecs.open(file, "r", "utf-8-sig") as f: a, b, c= map(int, f.readline().split(",")) </code></pre> This works in Python 2.6.4. The <code>codecs.open</code> call opens the file and returns data as unicode, decoding from UTF-8 and ignoring the initial BOM.

Dealing with UTF-8 numbers in Python

Suppose I am reading a file containing 3 comma separated numbers. The file was saved with with an unknown encoding, so far I am dealing with ANSI and UTF-8. If the file was in UTF-8 and it had 1 row with values 115,113,12 then:

with open(file) as f:
    a,b,c=map(int,f.readline().split(','))

would throw this:

invalid literal for int() with base 10: '\xef\xbb\xbf115'

The first number is always mangled with these '\xef\xbb\xbf' characters. For the rest 2 numbers the conversion works fine. If I manually replace '\xef\xbb\xbf' with '' and then do the int conversion it will work.

Is there a better way of doing this for any type of encoded file?

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

What is decode (' UTF-8 ') in Python?

decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.

Does UTF-8 include numbers?

UTF-8 treats numbers 0-127 as ASCII, 192-247 as Shift keys, and 128-192 as the key to be shifted. For instance, characters 208 and 209 shift you into the Cyrillic range. 208 followed by 175 is character 1071, the Cyrillic Я.

import codecs

with codecs.open(file, "r", "utf-8-sig") as f:
    a, b, c= map(int, f.readline().split(","))

This works in Python 2.6.4. The codecs.open call opens the file and returns data as unicode, decoding from UTF-8 and ignoring the initial BOM.

What you're seeing is a UTF-8 encoded BOM, or "Byte Order Mark". The BOM is not usually used for UTF-8 files, so the best way to handle it might be to open the file with a UTF-8 codec, and skip over the U+FEFF character if present.

Dealing with UTF-8 numbers in Python

Tags:

python

character-encoding

utf-8

byte-order-mark

Ηλίας

People also ask

2 Answers

tzot

Greg Hewgill

Recent Activity

Donate For Us

Dealing with UTF-8 numbers in Python

Tags:

python

character-encoding

utf-8

byte-order-mark

Ηλίας

People also ask

2 Answers

tzot

Greg Hewgill

Related questions

Recent Activity

Donate For Us