Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correctly detect encoding without any guessing when using Beautiful Soup

I’m working on improving the character encoding support for a Python IRC bot that retrieves the titles of pages whose URLs are mentioned in a channel.

The current process I’m using is as follows:

  1. Requests:

    r = requests.get(url, headers={ 'User-Agent': '...' })
    
  2. Beautiful Soup:

    soup = bs4.BeautifulSoup(r.text, from_encoding=r.encoding)
    
  3. title = soup.title.string.replace('\n', ' ').replace(...) etc.

Specifying from_encoding=r.encoding is a good start, because it allows us to heed the charset from the Content-Type header when parsing the page.

Where this falls on its face is with pages that specify a <meta http-equiv … charset=…"> or <meta charset="…"> instead (or on top) of a charset in their Content-Type header.

The approaches I currently see from here are as follows:

  1. Use Unicode, Dammit unconditionally when parsing the page. This is the default, but it seems to be ineffective for any of the pages that I’ve been testing it with.
  2. Use ftfy unconditionally before or after parsing the page. I’m not fond of this option, because it basically relies on guesswork for a task for which we (usually) have perfect information.
  3. Write code to look for an appropriate <meta> tag, try to heed any encodings we find there, then fall back on Requests’ .encoding, possibly in combination with the previous option. I find this option ideal, but I’d rather not write this code if it already exists.

TL;DR is there a Right Way™ to make Beautiful Soup correctly heed the character encoding of arbitrary HTML pages on the web, using a similar technique to what browsers use?

like image 729
Delan Azabani Avatar asked Sep 27 '22 00:09

Delan Azabani


2 Answers

It seems you want to prefer encodings declared in documents over those declared in the HTTP headers. UnicodeDammit (used internally by BeautifulSoup) does this the other way around if you just pass it the encoding from the header. You can overcome this by reading declared encodings from the document and passing those to try first. Roughly (untested!):

r = requests.get(url, headers={ 'User-Agent': '...' })

is_html = content_type_header.split(';', 1)[0].lower().startswith('text/html')
declared_encoding = UnicodeDammit.find_declared_encoding(r.text, is_html=is_html)

encodings_to_try = [r.encoding]
if declared_encoding is not None:
    encodings_to_try.insert(0, declared_encoding)
soup = bs4.BeautifulSoup(r.text, from_encoding=encodings_to_try)

title = soup.title...
like image 54
taleinat Avatar answered Oct 03 '22 08:10

taleinat


Unlike the more general module ftfy, the approach that Unicode, Dammit takes is exactly what I’m looking for (see bs4/dammit.py). It heeds the information provided by any <meta> tags, rather than applying more blind guesswork to the problem.

When r.text is used, however, Requests tries to be helpful by automatically decoding pages with the charset from their Content-Type header, falling back to ISO 8859-1 where it’s not present, but Unicode, Dammit does not touch any markup which is already in a unicode string!

The solution I chose was to use r.content instead:

  1. r = requests.get(url, headers={ 'User-Agent': '...' })
  2. soup = bs4.BeautifulSoup(r.content)
  3. title = soup.title.string.replace('\n', ' ').replace(...) etc.

The only drawback that I can see is that pages with only a charset from their Content-Type will be subject to some guesswork by Unicode, Dammit, because passing BeautifulSoup the from_encoding=r.encoding argument will override Unicode, Dammit completely.

like image 44
Delan Azabani Avatar answered Oct 03 '22 08:10

Delan Azabani