Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do Python unicode strings require special treatment for UTF-8 BOM?

For some reason, Python seems to be having issues with BOM when reading unicode strings from a UTF-8 file. Consider the following:

with open('test.py') as f:
   for line in f:
      print unicode(line, 'utf-8')

Seems straightforward, doesn't it?

That's what I thought until I ran it from command line and got:

UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>

A brief visitation to Google revealed that BOM has to be cleared manually:

import codecs
with open('test.py') as f:
   for line in f:
      print unicode(line.replace(codecs.BOM_UTF8, ''), 'utf-8')

This one runs fine. However I'm struggling to see any merit in this.

Is there a rationale behind above-described behavior? In contrast, UTF-16 works seamlessly.

like image 244
Saul Avatar asked Sep 01 '11 18:09

Saul


People also ask

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

What is the difference between UTF-8 and UTF-8 with BOM?

There is no official difference between UTF-8 and BOM-ed UTF-8. A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF. Those bytes, if present, must be ignored when extracting the string from the file/stream.

Are Python strings UTF-8?

In Python, Strings are by default in utf-8 format which means each alphabet corresponds to a unique code point.

What is Unicode string in Python?

Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world.


1 Answers

The 'utf-8-sig' encoding will consume the BOM signature on your behalf.

like image 151
Josh Lee Avatar answered Oct 22 '22 01:10

Josh Lee