is unicode( codecs.BOM_UTF8, "utf8" ) necessary in Python 2.7/3?

Question

In a code review I came across the following code:

# Python bug that renders the unicode identifier (0xEF 0xBB 0xBF)
# as a character.
# If untreated, it can prevent the page from validating or rendering 
# properly. 
bom = unicode( codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')

This is in a function that passes a string to Response object (Django or Flask).

Is this still a bug that needs this fix in Python 2.7 or 3? Something tells me it isn't, but I thought I'd ask because I don't know this problem very well.

I'm not sure where this came from, but I've seen it around the Internet, referenced sometimes in association with Jinja2 (which we are using).

Thanks for reading.

ekhumoro · Accepted Answer

The Unicode standard states that the character \ufeff has two distinct meanings. At the start of a data stream, it should be used as a byte-order and/or encoding signature, but elsewhere it should be interpreted as a zero-width non-breaking space.

So the code

bom = unicode(codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')

isn't just removing the utf-8 encoding signature (aka BOM) - it's also removing any embedded zero-width non-breaking spaces.

Some earlier versions of python did not have a variant of the "utf-8" codec which skips the BOM when reading data streams. Since this was inconsistent with the other other unicode codecs, a "utf-8-sig" codec was introduced with version 2.5, which does skip the BOM.

So it's possible the "Python bug" mentioned in the code comments relates to that.

However, it seems more likely that the "bug" relates to embedded \ufeff characters. But since the Unicode Standard clearly states they can be interpreted as legitimate characters, it is really up to the data consumer to decide how to treat them - and therefore not a bug in python.

is unicode( codecs.BOM_UTF8, "utf8" ) necessary in Python 2.7/3?

Tags:

python

unicode

utf-8

byte-order-mark

Brian M. Hunt

1 Answers

ekhumoro

Recent Activity

Donate For Us

is unicode( codecs.BOM_UTF8, "utf8" ) necessary in Python 2.7/3?

Tags:

python

unicode

utf-8

byte-order-mark

Brian M. Hunt

1 Answers

ekhumoro

Related questions

Recent Activity

Donate For Us