Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup? [duplicate]

I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.

However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.

Sample program:

import urllib2 from BeautifulSoup import BeautifulSoup  # Fetch URL url = 'http://www.voxnow.de/' request = urllib2.Request(url) request.add_header('Accept-Encoding', 'utf-8')  # Response has UTF-8 charset header, # and HTML body which is UTF-8 encoded response = urllib2.urlopen(request)  # Parse with BeautifulSoup soup = BeautifulSoup(response)  # Print title attribute of a <div> which uses umlauts (e.g. können) print repr(soup.find('div', id='navbutton_account')['title']) 

Running this gives the result:

# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!' 

But I would expect a Python Unicode string to render ö in the word können as \xf6:

# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!' 

I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read() and decode() the response object, but it either makes no difference, or throws an error.

With the command curl www.voxnow.de | hexdump -C, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6) for the ö character:

      20 74 69 74 6c 65 3d 22  48 69 65 72 20 6b c3 b6  | title="Hier k..|       6e 6e 65 6e 20 53 69 65  20 73 69 63 68 20 6b 6f  |nnen Sie sich ko|       73 74 65 6e 6c 6f 73 20  72 65 67 69 73 74 72 69  |stenlos registri| 

I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?

like image 713
Christopher Orr Avatar asked Nov 25 '13 23:11

Christopher Orr


People also ask

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

What does HTML parser do in Beautifulsoup?

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


2 Answers

As justhalf points out above, my question here is essentially a duplicate of this question.

The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.

This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:

soup = BeautifulSoup(response.read().decode('utf-8')) 

I would get the error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813:                      invalid continuation byte 

Looking more closely at the output, there was an instance of the character Ü which was wrongly encoded as the invalid byte sequence 0xe3 0x9c, rather than the correct 0xc3 0x9c.

As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore')) 
like image 156
Christopher Orr Avatar answered Oct 09 '22 12:10

Christopher Orr


Encoding the result to utf-8 seems to work for me:

print (soup.find('div', id='navbutton_account')['title']).encode('utf-8') 

It yields:

Hier können Sie sich kostenlos registrieren und / oder einloggen! 
like image 33
Birei Avatar answered Oct 09 '22 11:10

Birei