Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml truncates text that contains 'less than' character

>>> s = '<div> < 20 </div>'
>>> import lxml.html
>>> tree = lxml.html.fromstring(s)
>>> lxml.etree.tostring(tree)
'<div> </div>'

Does anybody know any workaround for this?

like image 909
Viacheslav Avatar asked Jan 05 '13 10:01

Viacheslav


2 Answers

Your HTML input is broken; that < left angle bracket should have been encoded to &lt; instead. From the lxml documentation on parsing broken HTML:

The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems.

In other words, you take what you can get from such documents, the way lxml handles broken HTML is not otherwise configurable.

One thing you could try is to use a different HTML parser. Try BeautifulSoup instead, it's broken HTML handling may be able to give you a different version of that document that does give you what you want out of it. BeautifulSoup can re-use different parser backends, including lxml and html5lib, so it'll give you more flexibility.

The html5lib parser does give you the < character (converted to a &lt; escape):

>>> BeautifulSoup("<div> < 20 </div>", "html5lib")
<html><head></head><body><div> &lt; 20 </div></body></html>
like image 149
Martijn Pieters Avatar answered Sep 27 '22 21:09

Martijn Pieters


Your < should actually be &lt;, since < is sorta like a 'reserved character' in html. Then it should work.

like image 45
Volatility Avatar answered Sep 27 '22 23:09

Volatility