Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeEncodeError when fetching url

I have this issue trying to get all the text nodes in an HTML document using lxml but I get an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128). However, when I try to find out the type of encoding of this page (encoding = chardet.detect(response)['encoding']), it says it's utf-8. It seems weird that a single page has utf-8 and ascii. Actually, this:

fromstring(response).text_content().encode('ascii', 'replace')

solves the problem.

Here it's my code:

from lxml.html import fromstring
import urllib2
import chardet
request = urllib2.Request(my_url)
request.add_header('User-Agent',
                   'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')   
request.add_header("Accept-Language", "en-us")
response = urllib2.urlopen(request).read()

print encoding
print fromstring(response).text_content()

Output:

utf-8
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)

What can I do to solve this issue?. Keep in mind that I want to do this with a few other pages, so I don't want to encode on an individual basis.

UPDATE:

Maybe there is something else going on here. When I run this script on the terminal, I get a correct output but when a run it inside SublimeText, I get UnicodeEncodeError... ¿?

UPDATE2:

It's also happening when I create a file with this output. .encode('ascii', 'replace') is working but I'd like to have a more general solution.

Regards

like image 669
Robert Smith Avatar asked Jun 16 '12 00:06

Robert Smith


2 Answers

Can you try wrapping your string with repr()? This article might help.

print repr(fromstring(response).text_content())
like image 184
ChipJust Avatar answered Oct 01 '22 14:10

ChipJust


As far as writing out to a file as said in your edit, I would recommend opening the file with the codecs module:

import codecs
output_file = codecs.open('filename.txt','w','utf8')

I don't know SublimeText, but it seems to be trying to read your output as ASCII, hence the encoding error.

like image 38
Justin.Wood Avatar answered Oct 01 '22 13:10

Justin.Wood