I have this issue trying to get all the text nodes in an HTML document using lxml but I get an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)
. However, when I try to find out the type of encoding of this page (encoding = chardet.detect(response)['encoding']
), it says it's utf-8
. It seems weird that a single page has utf-8 and ascii. Actually, this:
fromstring(response).text_content().encode('ascii', 'replace')
solves the problem.
Here it's my code:
from lxml.html import fromstring
import urllib2
import chardet
request = urllib2.Request(my_url)
request.add_header('User-Agent',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')
request.add_header("Accept-Language", "en-us")
response = urllib2.urlopen(request).read()
print encoding
print fromstring(response).text_content()
Output:
utf-8
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)
What can I do to solve this issue?. Keep in mind that I want to do this with a few other pages, so I don't want to encode on an individual basis.
UPDATE:
Maybe there is something else going on here. When I run this script on the terminal, I get a correct output but when a run it inside SublimeText, I get UnicodeEncodeError... ¿?
UPDATE2:
It's also happening when I create a file with this output. .encode('ascii', 'replace')
is working but I'd like to have a more general solution.
Regards
Can you try wrapping your string with repr()? This article might help.
print repr(fromstring(response).text_content())
As far as writing out to a file as said in your edit, I would recommend opening the file with the codecs module:
import codecs
output_file = codecs.open('filename.txt','w','utf8')
I don't know SublimeText, but it seems to be trying to read your output as ASCII, hence the encoding error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With