lxml not parsing unicode properly for HTML

Question

I am trying to parse HTML, but unfortunately lxml is not allowing me to grab the actual text:

node = lxml.html.fromstring(r.content)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']

# @@#### DÃ©mineurs

What do I need to do to correctly parse this text? Here is the web page: https://play.google.com/store/movies/details/D%C3%A9mineurs?id=KChu8wf5eVo&hl=fr and the text should be Démineurs.

Ignacio Vazquez-Abrams · Accepted Answer

The document has no encoding information, therefore you need to create a parser that uses the correct encoding by default.

>>> lxml.html.fromstring('<p>é</p>').text
u'\xc3\xa9'
>>> hp = lxml.etree.HTMLParser(encoding='utf-8')
>>> lxml.html.fromstring('<p>é</p>', parser=hp).text
u'\xe9'

jedwards · Answer

It's just an encoding issue.

It looks like you're using requests, which is good, because it does this work for you.

First, requests guesses at the encoding, which you can access with r.encoding. For that page, requests guessed at utf-8.

You could do:

data = r.content.decode('UTF-8')
# or
data = r.content.decode(r.encoding)
# then
node = lxml.html.fromstring(data)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']

which works:

@@#### Démineurs

But better yet, just use the text attribute, which has the output already decoded correctly.

node = lxml.html.fromstring(r.text)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']

works:

@@#### Démineurs

lxml not parsing unicode properly for HTML

Tags:

python

unicode

lxml

David542

2 Answers

Ignacio Vazquez-Abrams

jedwards

Recent Activity

Donate For Us

lxml not parsing unicode properly for HTML

Tags:

python

unicode

lxml

David542

2 Answers

Ignacio Vazquez-Abrams

jedwards

Related questions

Recent Activity

Donate For Us