I am trying to parse HTML, but unfortunately lxml
is not allowing me to grab the actual text:
node = lxml.html.fromstring(r.content)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']
# @@#### Démineurs
What do I need to do to correctly parse this text? Here is the web page: https://play.google.com/store/movies/details/D%C3%A9mineurs?id=KChu8wf5eVo&hl=fr and the text should be Démineurs.
The document has no encoding information, therefore you need to create a parser that uses the correct encoding by default.
>>> lxml.html.fromstring('<p>é</p>').text
u'\xc3\xa9'
>>> hp = lxml.etree.HTMLParser(encoding='utf-8')
>>> lxml.html.fromstring('<p>é</p>', parser=hp).text
u'\xe9'
It's just an encoding issue.
It looks like you're using requests, which is good, because it does this work for you.
First, requests guesses at the encoding, which you can access with r.encoding
. For that page, requests guessed at utf-8.
You could do:
data = r.content.decode('UTF-8')
# or
data = r.content.decode(r.encoding)
# then
node = lxml.html.fromstring(data)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']
which works:
@@#### Démineurs
But better yet, just use the text
attribute, which has the output already decoded correctly.
node = lxml.html.fromstring(r.text)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']
works:
@@#### Démineurs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With