I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup. However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding. Sample program: <pre class="prettyprint lang-py prettyprint-override"><code>import urllib2 from BeautifulSoup import BeautifulSoup # Fetch URL url = 'http://www.voxnow.de/' request = urllib2.Request(url) request.add_header('Accept-Encoding', 'utf-8') # Response has UTF-8 charset header, # and HTML body which is UTF-8 encoded response = urllib2.urlopen(request) # Parse with BeautifulSoup soup = BeautifulSoup(response) # Print title attribute of a <div> which uses umlauts (e.g. können) print repr(soup.find('div', id='navbutton_account')['title']) </code></pre> Running this gives the result: <pre class="prettyprint"><code># u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!' </code></pre> But I would expect a Python Unicode string to render <code>ö</code> in the word <code>können</code> as <code>\xf6</code>: <pre class="prettyprint"><code># u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!' </code></pre> I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to <code>read()</code> and <code>decode()</code> the <code>response</code> object, but it either makes no difference, or throws an error. With the command <code>curl www.voxnow.de | hexdump -C</code>, I can see that the web page is indeed UTF-8 encoded (i.e. it contains <code>0xc3 0xb6</code>) for the <code>ö</code> character: <pre class="prettyprint"><code> 20 74 69 74 6c 65 3d 22 48 69 65 72 20 6b c3 b6 | title="Hier k..| 6e 6e 65 6e 20 53 69 65 20 73 69 63 68 20 6b 6f |nnen Sie sich ko| 73 74 65 6e 6c 6f 73 20 72 65 67 69 73 74 72 69 |stenlos registri| </code></pre> I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?

Encoding the result to <code>utf-8</code> seems to work for me: <pre class="prettyprint"><code>print (soup.find('div', id='navbutton_account')['title']).encode('utf-8') </code></pre> It yields: <pre class="prettyprint"><code>Hier kÃ¶nnen Sie sich kostenlos registrieren und / oder einloggen! </code></pre>

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup? [duplicate]

Tags:

python

unicode

beautifulsoup

utf-8

urllib2

I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.

However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.

Sample program:

import urllib2 from BeautifulSoup import BeautifulSoup  # Fetch URL url = 'http://www.voxnow.de/' request = urllib2.Request(url) request.add_header('Accept-Encoding', 'utf-8')  # Response has UTF-8 charset header, # and HTML body which is UTF-8 encoded response = urllib2.urlopen(request)  # Parse with BeautifulSoup soup = BeautifulSoup(response)  # Print title attribute of a <div> which uses umlauts (e.g. können) print repr(soup.find('div', id='navbutton_account')['title'])

Running this gives the result:

# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'

But I would expect a Python Unicode string to render ö in the word können as \xf6:

# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'

I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read() and decode() the response object, but it either makes no difference, or throws an error.

With the command curl www.voxnow.de | hexdump -C, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6) for the ö character:

      20 74 69 74 6c 65 3d 22  48 69 65 72 20 6b c3 b6  | title="Hier k..|       6e 6e 65 6e 20 53 69 65  20 73 69 63 68 20 6b 6f  |nnen Sie sich ko|       73 74 65 6e 6c 6f 73 20  72 65 67 69 73 74 72 69  |stenlos registri|

I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?

713

asked Nov 25 '13 23:11

Christopher Orr

2 Answers

As justhalf points out above, my question here is essentially a duplicate of this question.

The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.

This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:

soup = BeautifulSoup(response.read().decode('utf-8'))

I would get the error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813:                      invalid continuation byte

Looking more closely at the output, there was an instance of the character Ü which was wrongly encoded as the invalid byte sequence 0xe3 0x9c, rather than the correct 0xc3 0x9c.

As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

156

answered Oct 09 '22 12:10

Christopher Orr

Encoding the result to utf-8 seems to work for me:

print (soup.find('div', id='navbutton_account')['title']).encode('utf-8')

It yields:

Hier kÃ¶nnen Sie sich kostenlos registrieren und / oder einloggen!

answered Oct 09 '22 11:10

Birei

Related questions
                            
                                Comprehensive tutorial on Pyinstaller? [closed]
                            
                                Difference between map and dict
                            
                                Can matplotlib add metadata to saved figures?
                            
                                Convert words between verb/noun/adjective forms
                            
                                Using utf-8 characters in a Jinja2 template
                            
                                How to use the Python getpass.getpass in PyCharm
                            
                                Changing image hue with Python PIL
                            
                                Does Python go well with QML (Qt-Quick)?
                            
                                Pythonic way to iterate through a range starting at 1
                            
                                Python ConfigParser.NoSectionError: No section:
                            
                                What does the --pre option in pip signify?
                            
                                What are the differences between setUpClass, setUpTestData and setUp in TestCase class?
                            
                                Setting SECURE_HSTS_SECONDS can irreversibly break your site?
                            
                                how to get tz_info object corresponding to current timezone?
                            
                                Is there any adequate scaffolding for Django? (à la Ruby on Rails)
                            
                                Using Django Managers vs. staticmethod on Model class directly
                            
                                Preventing django from appending "_id" to a foreign key field
                            
                                How to break out of while loop in Python?
                            
                                How can I send variables to Jinja template from a Flask decorator?
                            
                                raise with no argument

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup? [duplicate]

Tags:

python

unicode

beautifulsoup

utf-8

urllib2

Christopher Orr

People also ask

2 Answers

Christopher Orr

Birei

Recent Activity

Donate For Us