Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml not parsing unicode properly for HTML

I am trying to parse HTML, but unfortunately lxml is not allowing me to grab the actual text:

node = lxml.html.fromstring(r.content)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']

# @@#### Démineurs

What do I need to do to correctly parse this text? Here is the web page: https://play.google.com/store/movies/details/D%C3%A9mineurs?id=KChu8wf5eVo&hl=fr and the text should be Démineurs.

like image 619
David542 Avatar asked Mar 15 '15 04:03

David542


2 Answers

The document has no encoding information, therefore you need to create a parser that uses the correct encoding by default.

>>> lxml.html.fromstring('<p>é</p>').text
u'\xc3\xa9'
>>> hp = lxml.etree.HTMLParser(encoding='utf-8')
>>> lxml.html.fromstring('<p>é</p>', parser=hp).text
u'\xe9'
like image 122
Ignacio Vazquez-Abrams Avatar answered Nov 14 '22 23:11

Ignacio Vazquez-Abrams


It's just an encoding issue.

It looks like you're using requests, which is good, because it does this work for you.

First, requests guesses at the encoding, which you can access with r.encoding. For that page, requests guessed at utf-8.

You could do:

data = r.content.decode('UTF-8')
# or
data = r.content.decode(r.encoding)
# then
node = lxml.html.fromstring(data)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']

which works:

@@#### Démineurs

But better yet, just use the text attribute, which has the output already decoded correctly.

node = lxml.html.fromstring(r.text)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']

works:

@@#### Démineurs
like image 29
jedwards Avatar answered Nov 15 '22 00:11

jedwards