How to remove tags that have no content

Question

I am working on some old html that features a lot of empty tags: <o:p></o:p>. This seriously destroys the algorithms I use to traverse the tree.

Is there a way to clean the BeautifulSoup object prior to traversing it?

from bs4 import BeautifulSoup

html_object = """
<i style='mso-bidi-font-style:normal'><span style='font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial'><o:p></o:p></span></i>
"""
soup = BeautifulSoup(html_object, "lxml")

Not even .prettify() is able to remove empty tags:

>>> print(soup.prettify())
<html>
 <body>
  <i style="mso-bidi-font-style:normal">
   <span style="font-size:11.0pt;font-family:
  Univers;mso-bidi-font-family:Arial">
    <o:p>
    </o:p>
   </span>
  </i>
 </body>
</html>

I would like to see the output of this call completely empty.

Sverrir Sigmundarson · Accepted Answer

The existing answers in here have a slight problem as they all remove the   element which is always empty but crucial for the structure of the HTML.

Keep all breaks

 [x.decompose() for x in soup.findAll(lambda tag: not tag.contents and not tag.name == 'br' )]

Source

<p><p></p><strong>some<br>text<br>here</strong></p>

Output

<p><strong>some<br>text<br>here</strong></p>

Remove also elements full of whitespace

Also in case you also want to remove tags that only contain white-space you may want to do something like

[x.decompose() for x in soup.findAll(lambda tag: (not tag.contents or len(tag.get_text(strip=True)) <= 0) and not tag.name == 'br' )]

Source

<p><p>    </p><p></p><strong>some<br>text<br>here</strong></p>

Output

<p><strong>some<br>text<br>here</strong></p>

Remi Crystal · Answer

Here is a way to remove any tag which has no content:

>>> html = soup.findAll(lambda tag: tag.string is None)
>>> [tag.extract() for tag in html]
>>> print(soup.prettify())

And output is an empty string for your example, since there's no tag has a content.

If you only want to remove tag which has no content, but don't remove tag which has attributes. Like only remove <o:p></o:p>, there's another way:

>>> html = soup.findAll(lambda tag: not tag.contents)
>>> [tag.extract() for tag in html]
>>> print(soup.prettify())

Output:

<i style="mso-bidi-font-style:normal">
 <span style="font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial">
 </span>
</i>

The span and i tags are saved because they have attributes, although there's no content.

Martin Evans · Answer

If your focus is keeping just textual elements, how about the following approach? This removes all elements which contain no text, for example images. Add any tags such as br or img that must not be removed.

It really depends on what structure you want to remain.

from bs4 import BeautifulSoup

html_object = """
<i style='mso-bidi-font-style:normal'><span style='font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial'><o:p></o:p></span></i>
<i>hello world</i>
"""
soup = BeautifulSoup(html_object, "lxml")

for x in soup.find_all():
    if len(x.get_text(strip=True)) == 0 and x.name not in ['br', 'img']:
        x.extract()

print(soup)

Giving:

<html><body>
<i>hello world</i>
</body></html>

How to remove tags that have no content

Tags:

python

html

beautifulsoup

MERose

3 Answers

Sverrir Sigmundarson

Remi Crystal

Martin Evans

Recent Activity

Donate For Us

How to remove tags that have no content

Tags:

python

html

beautifulsoup

MERose

3 Answers

Sverrir Sigmundarson

Remi Crystal

Martin Evans

Related questions

Recent Activity

Donate For Us