First, some background: I'm developing a web application using Python. All of my (text) files are currently stored in UTF-8 with the BOM. This includes all my HTML templates and CSS files. These resources are stored as binary data (BOM and all) in my DB.
When I retrieve the templates from the DB, I decode them using template.decode('utf-8')
. When the HTML arrives in the browser, the BOM is present at the beginning of the HTTP response body. This generates a very interesting error in Chrome:
Extra <html> encountered. Migrating attributes back to the original <html> element and ignoring the tag.
Chrome seems to generate an <html>
tag automatically when it sees the BOM and mistakes it for content, making the real <html>
tag an error.
So, using Python, what is the best way to remove the BOM from my UTF-8 encoded templates (if it exists -- I can't guarantee this in the future)?
For other text-based files like CSS, will major browsers correctly interpret (or ignore) the BOM? They are being sent as plain binary data without .decode('utf-8')
.
Note: I am using Python 2.5.
Thanks!
Since you state:
All of my (text) files are currently stored in UTF-8 with the BOM
then use the 'utf-8-sig' codec to decode them:
>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'
It automatically removes the expected BOM, and works correctly if the BOM is not present as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With