I’m working on improving the character encoding support for a Python IRC bot that retrieves the titles of pages whose URLs are mentioned in a channel.
The current process I’m using is as follows:
Requests:
r = requests.get(url, headers={ 'User-Agent': '...' })
Beautiful Soup:
soup = bs4.BeautifulSoup(r.text, from_encoding=r.encoding)
title = soup.title.string.replace('\n', ' ').replace(...)
etc.Specifying from_encoding=r.encoding
is a good start, because it allows us to heed the charset
from the Content-Type
header when parsing the page.
Where this falls on its face is with pages that specify a <meta http-equiv … charset=…">
or <meta charset="…">
instead (or on top) of a charset
in their Content-Type
header.
The approaches I currently see from here are as follows:
<meta>
tag, try to heed any encodings we find there, then fall back on Requests’ .encoding
, possibly in combination with the previous option. I find this option ideal, but I’d rather not write this code if it already exists.TL;DR is there a Right Way™ to make Beautiful Soup correctly heed the character encoding of arbitrary HTML pages on the web, using a similar technique to what browsers use?
It seems you want to prefer encodings declared in documents over those declared in the HTTP headers. UnicodeDammit (used internally by BeautifulSoup) does this the other way around if you just pass it the encoding from the header. You can overcome this by reading declared encodings from the document and passing those to try first. Roughly (untested!):
r = requests.get(url, headers={ 'User-Agent': '...' })
is_html = content_type_header.split(';', 1)[0].lower().startswith('text/html')
declared_encoding = UnicodeDammit.find_declared_encoding(r.text, is_html=is_html)
encodings_to_try = [r.encoding]
if declared_encoding is not None:
encodings_to_try.insert(0, declared_encoding)
soup = bs4.BeautifulSoup(r.text, from_encoding=encodings_to_try)
title = soup.title...
Unlike the more general module ftfy, the approach that Unicode, Dammit takes is exactly what I’m looking for (see bs4/dammit.py
). It heeds the information provided by any <meta>
tags, rather than applying more blind guesswork to the problem.
When r.text
is used, however, Requests tries to be helpful by automatically decoding pages with the charset
from their Content-Type
header, falling back to ISO 8859-1 where it’s not present, but Unicode, Dammit does not touch any markup which is already in a unicode
string!
The solution I chose was to use r.content
instead:
r = requests.get(url, headers={ 'User-Agent': '...' })
soup = bs4.BeautifulSoup(r.content)
title = soup.title.string.replace('\n', ' ').replace(...)
etc.The only drawback that I can see is that pages with only a charset
from their Content-Type
will be subject to some guesswork by Unicode, Dammit, because passing BeautifulSoup
the from_encoding=r.encoding
argument will override Unicode, Dammit completely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With