This sample python program:
document='''<p>This is <i>something</i>, it happens
in <b>real</b> life</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(document)
print(soup.prettify())
produces the following output:
<html>
<body>
<p>
This is
<i>
something
</i>
, it happens
in
<b>
real
</b>
life
</p>
</body>
</html>
That's wrong, because it adds whitespace before and after each opening and closing tag and, for example, there should be no space between </i>
and ,
. I would like it to:
Not add whitespace where there are none (even around block-level tags they could be problematic, if they are styled with display:inline
in CSS.)
Collapse all whitespace in a single space, except optionally for line wrapping.
Something like this:
<html>
<body>
<p>This is
<i>something</i>,
it happens in
<b>real</b> life</p>
</body>
</html>
Is this possible with BeautifulSoup
? Any other recommended HTML parser that can deal with this?
Because of the habit of .prettify
to put each tag in it's own line, it is not suitable for production code; it is only usable for debugging output, IMO. Just convert your soup to a string, using the str
builtin function.
What you want is a change of the string contents in your tree; you could create a function to find all elements which contain sequences of two or more whitespace characters (using a pre-compiled regular expression), and then replace their contents.
BTW, you can have Python avoid the insertion of insignificant whitespace if you write your example like so:
document = ('<p>This is <i>something</i>, it happens '
'in <b>real</b> life</p>')
This way you have two literals which are implicitly concatinated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With