How to remove insignificant whitespace in lxml.html?

Question

I'm rather surprised that lxml.html leaves insignificant whitespace when parsing HTML by default. I'm also surprised that I can't find any obvious way to make it not do that.

Python 2.7.3 (default, Apr 10 2013, 06:20:15) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.etree
>>> parser = lxml.etree.HTMLParser(remove_blank_text=True)
>>> html = lxml.etree.HTML("<p>      Hello     World     </p>", parser=parser)
>>> print lxml.etree.tostring(html)
<html><body><p>      Hello     World     </p></body></html>

I expect the result would be something like:

>>> print lxml.etree.tostring(html)
<html><body><p>Hello World</p></body></html>

BeautifulSoup4 does the same thing with the html5lib parser:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>      Hello     World     </p>", "html5lib")
>>> soup.p
<p>      Hello     World     </p>

After doing some research, I found that the HTML5 parsing specification does not specify to remove consecutive whitespace; that is done at render time instead. So I understand that's it technically not the responsibility of any of these libraries to perform the same behavior, but it seems useful enough that I'm surprised none of them have it anyway.

Can somebody prove me wrong?

Edit:

I know how to remove whitespace using a regex — that was not my question. (I also know how to search SO for questions about regex.)

My question has to do with the insignificant whitespace, where significance is defined by the standards for rendering HTML. I doubt that a 1-liner regex can correctly implement this standard. And let's not even delve into the regex vs CFG debate again, please?

RegEx match open tags except XHTML self-contained tags

Edit 2:

In case it's not clear from the context, I am interested in HTML, not XHTML/XML. Whitespace does have some non-trivial rules of significance in HTML, however those rules are implemented in the renderer, not the parser. I understand that, as evidenced in my initial post. My question is whether anybody has implemented the white space logic of an HTML renderer in a library that operates at the DOM level rather than at the rendering level?

Ivan Chaer · Accepted Answer

I came across this library.

Can be installed with pip:

pip install htmlmin

It's used like:

from htmlmin import minify
html=u"<html><body><p>      Hello     World     </p></body></html>"
minified_html = minify(html)
print minified_html

Which returns:

<html><body><p> Hello World </p></body></html>

I thought it would do what you were looking for, but as you see, some irrelevant spaces were kept.

How to remove insignificant whitespace in lxml.html?

Tags:

python

html-parsing

lxml.html

Mark E. Haase

1 Answers

Ivan Chaer

Recent Activity

Donate For Us

How to remove insignificant whitespace in lxml.html?

Tags:

python

html-parsing

lxml.html

Mark E. Haase

1 Answers

Ivan Chaer

Related questions

Recent Activity

Donate For Us