I am trying the following code with a particular HTML file <pre class="prettyprint"><code>from BeautifulSoup import BeautifulSoup import re import codecs import sys f = open('test1.html') html = f.read() soup = BeautifulSoup(html) body = soup.body.contents para = soup.findAll('p') print str(para).encode('utf-8') </code></pre> I get the following error: <pre class="prettyprint"><code>UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 9: ordinal not in range(128) </code></pre> How do I debug this? I do not get any error when I remove the call to print function.

The <code>str(para)</code> builtin is trying to use the default (<code>ascii</code>) encoding for the unicode in <code>para</code>. This is done before the <code>encode()</code> call: <pre class="prettyprint"><code>>>> s=u'123\u2019' >>> str(s) Traceback (most recent call last): File "<interactive input>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128) >>> s.encode("utf-8") '123\xe2\x80\x99' >>> </code></pre> Try encoding <code>para</code> directly, maybe by applying <code>encode("utf-8")</code> to each list element.

Beautiful Soup Unicode encode error

Tags:

python

unicode

beautifulsoup

I am trying the following code with a particular HTML file

from BeautifulSoup import BeautifulSoup
import re
import codecs
import sys
f = open('test1.html')
html = f.read()
soup = BeautifulSoup(html)
body = soup.body.contents
para = soup.findAll('p')
print str(para).encode('utf-8')

I get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 9: ordinal not in range(128)

How do I debug this?

I do not get any error when I remove the call to print function.

884

asked Apr 13 '10 04:04

Rohit Banga

1 Answers

The str(para) builtin is trying to use the default (ascii) encoding for the unicode in para. This is done before the encode() call:

>>> s=u'123\u2019'
>>> str(s)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> s.encode("utf-8")
'123\xe2\x80\x99'
>>>

Try encoding para directly, maybe by applying encode("utf-8") to each list element.

answered Sep 28 '22 18:09

gimel

Related questions
                            
                                How to use py_func with a function that returns dict
                            
                                What does "Broker transport failure" mean in kafka?
                            
                                Weird behaviour with groupby on ordered categorical columns
                            
                                Simulation of t copula in Python
                            
                                Showing cropped image in bokeh
                            
                                Google Cloud ML-engine scikit-learn prediction probability 'predict_proba()'
                            
                                Errors packaging app for android using ubuntu and buildozer
                            
                                How can I construct a Pandas DataFrame from individual 1D Numpy arrays without copying
                            
                                Change code while debugging python program in Visual Studio Code
                            
                                Is there an equivalent of kable (R) on python?
                            
                                How to connect a Jupyter Notebook to a Spyder kernel?
                            
                                Extracting the license plate parallelogram from the surrounding bounding box?
                            
                                Most scalable way for using generators with tf.data ? tf.data guide says `from_generator` has limited scalability
                            
                                How to properly handle multiple binary files in python?
                            
                                How to find the minimum number of moves to move an item into a position in a stack?
                            
                                How to find which DLL failed in "ImportError: DLL load failed while importing" in python?
                            
                                VSCode integrated source control and pre-commit
                            
                                Saving and reload huggingface fine-tuned transformer
                            
                                How to intercept the first value of a generator and transparently yield from the rest

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With