lxml truncates text that contains 'less than' character

Question

>>> s = '<div> < 20 </div>'
>>> import lxml.html
>>> tree = lxml.html.fromstring(s)
>>> lxml.etree.tostring(tree)
'<div> </div>'

Does anybody know any workaround for this?

Martijn Pieters · Accepted Answer

Your HTML input is broken; that < left angle bracket should have been encoded to < instead. From the lxml documentation on parsing broken HTML:

The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems.

In other words, you take what you can get from such documents, the way lxml handles broken HTML is not otherwise configurable.

One thing you could try is to use a different HTML parser. Try BeautifulSoup instead, it's broken HTML handling may be able to give you a different version of that document that does give you what you want out of it. BeautifulSoup can re-use different parser backends, including lxml and html5lib, so it'll give you more flexibility.

The html5lib parser does give you the < character (converted to a < escape):

>>> BeautifulSoup("<div> < 20 </div>", "html5lib")
<html><head></head><body><div> &lt; 20 </div></body></html>

Volatility · Answer

Your < should actually be <, since < is sorta like a 'reserved character' in html. Then it should work.

lxml truncates text that contains 'less than' character

Tags:

python

html-parsing

lxml

Viacheslav

2 Answers

Martijn Pieters

Volatility

Recent Activity

Donate For Us

lxml truncates text that contains 'less than' character

Tags:

python

html-parsing

lxml

Viacheslav

2 Answers

Martijn Pieters

Volatility

Related questions

Recent Activity

Donate For Us