Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is unicode( codecs.BOM_UTF8, "utf8" ) necessary in Python 2.7/3?

In a code review I came across the following code:

# Python bug that renders the unicode identifier (0xEF 0xBB 0xBF)
# as a character.
# If untreated, it can prevent the page from validating or rendering 
# properly. 
bom = unicode( codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')

This is in a function that passes a string to Response object (Django or Flask).

Is this still a bug that needs this fix in Python 2.7 or 3? Something tells me it isn't, but I thought I'd ask because I don't know this problem very well.

I'm not sure where this came from, but I've seen it around the Internet, referenced sometimes in association with Jinja2 (which we are using).

Thanks for reading.

like image 376
Brian M. Hunt Avatar asked Nov 11 '11 15:11

Brian M. Hunt


1 Answers

The Unicode standard states that the character \ufeff has two distinct meanings. At the start of a data stream, it should be used as a byte-order and/or encoding signature, but elsewhere it should be interpreted as a zero-width non-breaking space.

So the code

bom = unicode(codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')

isn't just removing the utf-8 encoding signature (aka BOM) - it's also removing any embedded zero-width non-breaking spaces.

Some earlier versions of python did not have a variant of the "utf-8" codec which skips the BOM when reading data streams. Since this was inconsistent with the other other unicode codecs, a "utf-8-sig" codec was introduced with version 2.5, which does skip the BOM.

So it's possible the "Python bug" mentioned in the code comments relates to that.

However, it seems more likely that the "bug" relates to embedded \ufeff characters. But since the Unicode Standard clearly states they can be interpreted as legitimate characters, it is really up to the data consumer to decide how to treat them - and therefore not a bug in python.

like image 100
ekhumoro Avatar answered Sep 21 '22 12:09

ekhumoro