I'm learning about urllib2 and Beautiful Soup and on first tests am getting errors like:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)
There seem to be lots of posts about this type of error and I have tried the solutions that I can understand but there seem to be catch 22's with them, e.g.:
I want to print post.text
(where text is a beautiful soup method that just returns the text).
str(post.text)
and post.text
produce the unicode errors (on things like right apostrophe's '
and ...
).
So I add post = unicode(post)
above str(post.text)
, then I get:
AttributeError: 'unicode' object has no attribute 'text'
I also tried (post.text).encode()
and (post.text).renderContents()
.
The latter producing the error:
AttributeError: 'unicode' object has no attribute 'renderContents'
and then I tried str(post.text).renderContents()
and got the error:
AttributeError: 'str' object has no attribute 'renderContents'
It would be great if I could just define somewhere at the top of the document 'make this content 'interpretable''
and still have access to the required text
function.
Update: after suggestions:
If I add post = post.decode("utf-8")
above str(post.text)
I get:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I add post = post.decode()
above str(post.text)
I get:
AttributeError: 'unicode' object has no attribute 'text'
If I add post = post.encode("utf-8")
above (post.text)
I get:
AttributeError: 'str' object has no attribute 'text'
I tried print post.text.encode('utf-8')
and got:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)
And for the sake of trying things that might work, I installed lxml for Windows from here and implemented it with:
parsed_content = BeautifulSoup(original_content, "lxml")
according to http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters.
These steps didn't seem to make a difference.
I'm using Python 2.7.4 and Beautiful Soup 4.
Solution:
After getting a deeper understanding of unicode, utf-8 and Beautiful Soup types, it had something to do with my printing methodology. I removed all my str
methods and concatenations, e.g. str(something) + post.text + str(something_else)
, so that it was something, post.text, something_else
and it seems to be printing well except I have less control of the formatting at this stage (e.g. spaces inserted at ,
).
Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.
The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since codings map only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail. Encoding from unicode to str. >>>
In Python 2, unicode
objects can only be printed if they can be converted to ASCII. If it can't be encoded in ASCII, you'll get that error. You probably want to explicitly encode it and then print the resulting str
:
print post.text.encode('utf-8')
html = urllib.request.urlopen(THE_URL).read()
soup = BeautifulSoup(html)
print("'" + str(soup.encode("ascii")) + "'")
worked for me ;-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With