For some reason, Python seems to be having issues with BOM when reading unicode strings from a UTF-8 file. Consider the following:
with open('test.py') as f:
for line in f:
print unicode(line, 'utf-8')
Seems straightforward, doesn't it?
That's what I thought until I ran it from command line and got:
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to
<undefined>
A brief visitation to Google revealed that BOM has to be cleared manually:
import codecs
with open('test.py') as f:
for line in f:
print unicode(line.replace(codecs.BOM_UTF8, ''), 'utf-8')
This one runs fine. However I'm struggling to see any merit in this.
Is there a rationale behind above-described behavior? In contrast, UTF-16 works seamlessly.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
There is no official difference between UTF-8 and BOM-ed UTF-8. A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF. Those bytes, if present, must be ignored when extracting the string from the file/stream.
In Python, Strings are by default in utf-8 format which means each alphabet corresponds to a unique code point.
Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world.
The 'utf-8-sig'
encoding will consume the BOM signature on your behalf.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With