Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrape with correct character encoding (python requests + beautifulsoup)

I have an issue parsing this website: http://fm4-archiv.at/files.php?cat=106

It contains special characters such as umlauts. See here:enter image description here

My chrome browser displays the umlauts properly as you can see in the screenshot above. However on other pages (e.g.: http://fm4-archiv.at/files.php?cat=105) the umlauts are not displayed properly, as can be seen in the screenshot below: enter image description here

The meta HTML tag defines the following charset on the pages:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>

I use the python requests package to get the HTML and then use Beautifulsoup to scrape the desired data. My code is as follows:

r = requests.get(URL)
soup = BeautifulSoup(r.content,"lxml")

If I print the encoding (print(r.encoding) the result is UTF-8. If I manually change the encoding to ISO-8859-1 or cp1252 by calling r.encoding = ISO-8859-1 nothing changes when I output the data on the console. This is also my main issue.

r = requests.get(URL)
r.encoding = 'ISO-8859-1'
soup = BeautifulSoup(r.content,"lxml")

still results in the following string shown on the console output in my python IDE:

Der Wildlöwenpfleger

instead it should be

Der Wildlöwenpfleger

How can I change my code to parse the umlauts properly?

like image 295
beta Avatar asked Sep 16 '17 11:09

beta


1 Answers

In general, instead of using r.content which is the byte string received, use r.text which is the decoded content using the encoding determined by requests.

In this case requests will use UTF-8 to decode the incoming byte string because this is the encoding reported by the server in the Content-Type header:

import requests

r = requests.get('http://fm4-archiv.at/files.php?cat=106')

>>> type(r.content)    # raw content
<class 'bytes'>
>>> type(r.text)       # decoded to unicode
<class 'str'>    
>>> r.headers['Content-Type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'

>>> soup = BeautifulSoup(r.text, 'lxml')

That will fix the "Wildlöwenpfleger" problem, however, other parts of the page then begin to break, for example:

>>> soup = BeautifulSoup(r.text, 'lxml')     # using decoded string... should work
>>> soup.find_all('a')[39]
<a href="details.php?file=1882">Der Wildlöwenpfleger</a>
>>> soup.find_all('a')[10]
<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)

shows that "Wildlöwenpfleger" is fixed but now "übergeben" and others in the second link are broken.

It appears that multiple encodings are used in the one HTML document. The first link uses UTF-8 encoding:

>>> r.content[8013:8070].decode('iso-8859-1')
'<a href="details.php?file=1882">Der Wildlöwenpfleger</a>'

>>> r.content[8013:8070].decode('utf8')
'<a href="details.php?file=1882">Der Wildlöwenpfleger</a>'

but the second link uses ISO-8859-1 encoding:

>>> r.content[2868:3132].decode('iso-8859-1')
'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon übergeben. Auf Streifzügen durch die Popliteratur stößt Hermes auf deren große Themen und hört mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'

>>> r.content[2868:3132].decode('utf8', 'replace')
'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'

Obviously it is incorrect to use multiple encodings in the same HTML document. Other than contacting the document's author and asking for a correction, there is not much that you can easily do to handle the mixed encoding. Perhaps you can run chardet.detect() over the data as you process it, but it's not going to be pleasant.

like image 179
mhawke Avatar answered Sep 28 '22 18:09

mhawke