Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: do not add spaces where they matter, remove them where they don't

This sample python program:

document='''<p>This is <i>something</i>, it happens
               in <b>real</b> life</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(document)
print(soup.prettify())

produces the following output:

<html>
 <body>
  <p>
   This is
   <i>
    something
   </i>
   , it happens
               in
   <b>
    real
   </b>
   life
  </p>
 </body>
</html>

That's wrong, because it adds whitespace before and after each opening and closing tag and, for example, there should be no space between </i> and ,. I would like it to:

  1. Not add whitespace where there are none (even around block-level tags they could be problematic, if they are styled with display:inline in CSS.)

  2. Collapse all whitespace in a single space, except optionally for line wrapping.

Something like this:

<html>
 <body>
  <p>This is
   <i>something</i>,
   it happens in
   <b>real</b> life</p>
 </body>
</html>

Is this possible with BeautifulSoup? Any other recommended HTML parser that can deal with this?

like image 432
Jellby Avatar asked Aug 26 '14 20:08

Jellby


1 Answers

Because of the habit of .prettify to put each tag in it's own line, it is not suitable for production code; it is only usable for debugging output, IMO. Just convert your soup to a string, using the str builtin function.

What you want is a change of the string contents in your tree; you could create a function to find all elements which contain sequences of two or more whitespace characters (using a pre-compiled regular expression), and then replace their contents.

BTW, you can have Python avoid the insertion of insignificant whitespace if you write your example like so:

document = ('<p>This is <i>something</i>, it happens '
            'in <b>real</b> life</p>')

This way you have two literals which are implicitly concatinated.

like image 65
Tobias Avatar answered Sep 28 '22 19:09

Tobias