Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'

I'm learning about urllib2 and Beautiful Soup and on first tests am getting errors like:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

There seem to be lots of posts about this type of error and I have tried the solutions that I can understand but there seem to be catch 22's with them, e.g.:

I want to print post.text (where text is a beautiful soup method that just returns the text). str(post.text) and post.text produce the unicode errors (on things like right apostrophe's ' and ...).

So I add post = unicode(post) above str(post.text), then I get:

AttributeError: 'unicode' object has no attribute 'text'

I also tried (post.text).encode() and (post.text).renderContents(). The latter producing the error:

AttributeError: 'unicode' object has no attribute 'renderContents'

and then I tried str(post.text).renderContents() and got the error:

AttributeError: 'str' object has no attribute 'renderContents'

It would be great if I could just define somewhere at the top of the document 'make this content 'interpretable'' and still have access to the required text function.


Update: after suggestions:

If I add post = post.decode("utf-8") above str(post.text) I get:

TypeError: unsupported operand type(s) for -: 'str' and 'int'  

If I add post = post.decode() above str(post.text) I get:

AttributeError: 'unicode' object has no attribute 'text'

If I add post = post.encode("utf-8") above (post.text) I get:

AttributeError: 'str' object has no attribute 'text'

I tried print post.text.encode('utf-8') and got:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

And for the sake of trying things that might work, I installed lxml for Windows from here and implemented it with:

parsed_content = BeautifulSoup(original_content, "lxml")

according to http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters.

These steps didn't seem to make a difference.

I'm using Python 2.7.4 and Beautiful Soup 4.


Solution:

After getting a deeper understanding of unicode, utf-8 and Beautiful Soup types, it had something to do with my printing methodology. I removed all my str methods and concatenations, e.g. str(something) + post.text + str(something_else), so that it was something, post.text, something_else and it seems to be printing well except I have less control of the formatting at this stage (e.g. spaces inserted at ,).

like image 441
user1063287 Avatar asked Apr 27 '13 11:04

user1063287


People also ask

How do I fix UnicodeEncodeError in Python?

Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.

What is UnicodeEncodeError?

The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since codings map only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail. Encoding from unicode to str. >>>


2 Answers

In Python 2, unicode objects can only be printed if they can be converted to ASCII. If it can't be encoded in ASCII, you'll get that error. You probably want to explicitly encode it and then print the resulting str:

print post.text.encode('utf-8')
like image 147
icktoofay Avatar answered Oct 06 '22 14:10

icktoofay


    html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

worked for me ;-)

like image 2
Patpog Avatar answered Oct 06 '22 16:10

Patpog