Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup Unicode encode error

I am trying the following code with a particular HTML file

from BeautifulSoup import BeautifulSoup
import re
import codecs
import sys
f = open('test1.html')
html = f.read()
soup = BeautifulSoup(html)
body = soup.body.contents
para = soup.findAll('p')
print str(para).encode('utf-8')

I get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 9: ordinal not in range(128)

How do I debug this?

I do not get any error when I remove the call to print function.

like image 884
Rohit Banga Avatar asked Apr 13 '10 04:04

Rohit Banga


People also ask

How do I fix Unicode encode errors in Python?

Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.

What is BeautifulSoup prettify?

The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: Python3.

How do you use Beautiful Soup 4 in Python?

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .

Is Beautiful Soup included in Python?

As BeautifulSoup is not a standard python library, we need to install it first. We are going to install the BeautifulSoup 4 library (also known as BS4), which is the latest one.


1 Answers

The str(para) builtin is trying to use the default (ascii) encoding for the unicode in para. This is done before the encode() call:

>>> s=u'123\u2019'
>>> str(s)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> s.encode("utf-8")
'123\xe2\x80\x99'
>>> 

Try encoding para directly, maybe by applying encode("utf-8") to each list element.

like image 89
gimel Avatar answered Sep 28 '22 18:09

gimel