In a code review I came across the following code:
# Python bug that renders the unicode identifier (0xEF 0xBB 0xBF)
# as a character.
# If untreated, it can prevent the page from validating or rendering
# properly.
bom = unicode( codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')
This is in a function that passes a string to Response object (Django or Flask).
Is this still a bug that needs this fix in Python 2.7 or 3? Something tells me it isn't, but I thought I'd ask because I don't know this problem very well.
I'm not sure where this came from, but I've seen it around the Internet, referenced sometimes in association with Jinja2 (which we are using).
Thanks for reading.
The Unicode standard states that the character \ufeff
has two distinct meanings. At the start of a data stream, it should be used as a byte-order and/or encoding signature, but elsewhere it should be interpreted as a zero-width non-breaking space.
So the code
bom = unicode(codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')
isn't just removing the utf-8 encoding signature (aka BOM) - it's also removing any embedded zero-width non-breaking spaces.
Some earlier versions of python did not have a variant of the "utf-8" codec which skips the BOM when reading data streams. Since this was inconsistent with the other other unicode codecs, a "utf-8-sig" codec was introduced with version 2.5, which does skip the BOM.
So it's possible the "Python bug" mentioned in the code comments relates to that.
However, it seems more likely that the "bug" relates to embedded \ufeff
characters. But since the Unicode Standard clearly states they can be interpreted as legitimate characters, it is really up to the data consumer to decide how to treat them - and therefore not a bug in python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With