Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove tags that have no content

I am working on some old html that features a lot of empty tags: <i style='mso-bidi-font-style:normal'><span style='font-size:11.0pt;font-family: Univers;mso-bidi-font-family:Arial'><o:p></o:p></span></i>. This seriously destroys the algorithms I use to traverse the tree.

Is there a way to clean the BeautifulSoup object prior to traversing it?

from bs4 import BeautifulSoup

html_object = """
<i style='mso-bidi-font-style:normal'><span style='font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial'><o:p></o:p></span></i>
"""
soup = BeautifulSoup(html_object, "lxml")

Not even .prettify() is able to remove empty tags:

>>> print(soup.prettify())
<html>
 <body>
  <i style="mso-bidi-font-style:normal">
   <span style="font-size:11.0pt;font-family:
  Univers;mso-bidi-font-family:Arial">
    <o:p>
    </o:p>
   </span>
  </i>
 </body>
</html>

I would like to see the output of this call completely empty.

like image 544
MERose Avatar asked Nov 03 '15 13:11

MERose


3 Answers

The existing answers in here have a slight problem as they all remove the <br> element which is always empty but crucial for the structure of the HTML.

Keep all breaks

 [x.decompose() for x in soup.findAll(lambda tag: not tag.contents and not tag.name == 'br' )]

Source

<p><p></p><strong>some<br>text<br>here</strong></p>

Output

<p><strong>some<br>text<br>here</strong></p>

Remove also elements full of whitespace

Also in case you also want to remove tags that only contain white-space you may want to do something like

[x.decompose() for x in soup.findAll(lambda tag: (not tag.contents or len(tag.get_text(strip=True)) <= 0) and not tag.name == 'br' )]

Source

<p><p>    </p><p></p><strong>some<br>text<br>here</strong></p>

Output

<p><strong>some<br>text<br>here</strong></p>
like image 89
Sverrir Sigmundarson Avatar answered Sep 18 '22 09:09

Sverrir Sigmundarson


Here is a way to remove any tag which has no content:

>>> html = soup.findAll(lambda tag: tag.string is None)
>>> [tag.extract() for tag in html]
>>> print(soup.prettify())

And output is an empty string for your example, since there's no tag has a content.


If you only want to remove tag which has no content, but don't remove tag which has attributes. Like only remove <o:p></o:p>, there's another way:

>>> html = soup.findAll(lambda tag: not tag.contents)
>>> [tag.extract() for tag in html]
>>> print(soup.prettify())

Output:

<i style="mso-bidi-font-style:normal">
 <span style="font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial">
 </span>
</i>

The span and i tags are saved because they have attributes, although there's no content.

like image 45
Remi Crystal Avatar answered Sep 19 '22 09:09

Remi Crystal


If your focus is keeping just textual elements, how about the following approach? This removes all elements which contain no text, for example images. Add any tags such as br or img that must not be removed.

It really depends on what structure you want to remain.

from bs4 import BeautifulSoup

html_object = """
<i style='mso-bidi-font-style:normal'><span style='font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial'><o:p></o:p></span></i>
<i>hello world</i>
"""
soup = BeautifulSoup(html_object, "lxml")

for x in soup.find_all():
    if len(x.get_text(strip=True)) == 0 and x.name not in ['br', 'img']:
        x.extract()

print(soup)

Giving:

<html><body>
<i>hello world</i>
</body></html>
like image 34
Martin Evans Avatar answered Sep 20 '22 09:09

Martin Evans