I am working on some old html that features a lot of empty tags: <i style='mso-bidi-font-style:normal'><span style='font-size:11.0pt;font-family: Univers;mso-bidi-font-family:Arial'><o:p></o:p></span></i>
. This seriously destroys the algorithms I use to traverse the tree.
Is there a way to clean the BeautifulSoup object prior to traversing it?
from bs4 import BeautifulSoup
html_object = """
<i style='mso-bidi-font-style:normal'><span style='font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial'><o:p></o:p></span></i>
"""
soup = BeautifulSoup(html_object, "lxml")
Not even .prettify()
is able to remove empty tags:
>>> print(soup.prettify())
<html>
<body>
<i style="mso-bidi-font-style:normal">
<span style="font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial">
<o:p>
</o:p>
</span>
</i>
</body>
</html>
I would like to see the output of this call completely empty.
The existing answers in here have a slight problem as they all remove the <br>
element which is always empty but crucial for the structure of the HTML.
Keep all breaks
[x.decompose() for x in soup.findAll(lambda tag: not tag.contents and not tag.name == 'br' )]
Source
<p><p></p><strong>some<br>text<br>here</strong></p>
Output
<p><strong>some<br>text<br>here</strong></p>
Remove also elements full of whitespace
Also in case you also want to remove tags that only contain white-space you may want to do something like
[x.decompose() for x in soup.findAll(lambda tag: (not tag.contents or len(tag.get_text(strip=True)) <= 0) and not tag.name == 'br' )]
Source
<p><p> </p><p></p><strong>some<br>text<br>here</strong></p>
Output
<p><strong>some<br>text<br>here</strong></p>
Here is a way to remove any tag which has no content:
>>> html = soup.findAll(lambda tag: tag.string is None)
>>> [tag.extract() for tag in html]
>>> print(soup.prettify())
And output is an empty string for your example, since there's no tag has a content.
If you only want to remove tag which has no content, but don't remove tag which has attributes. Like only remove <o:p></o:p>
, there's another way:
>>> html = soup.findAll(lambda tag: not tag.contents)
>>> [tag.extract() for tag in html]
>>> print(soup.prettify())
Output:
<i style="mso-bidi-font-style:normal">
<span style="font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial">
</span>
</i>
The span
and i
tags are saved because they have attributes, although there's no content.
If your focus is keeping just textual elements, how about the following approach? This removes all elements which contain no text, for example images. Add any tags such as br
or img
that must not be removed.
It really depends on what structure you want to remain.
from bs4 import BeautifulSoup
html_object = """
<i style='mso-bidi-font-style:normal'><span style='font-size:11.0pt;font-family:
Univers;mso-bidi-font-family:Arial'><o:p></o:p></span></i>
<i>hello world</i>
"""
soup = BeautifulSoup(html_object, "lxml")
for x in soup.find_all():
if len(x.get_text(strip=True)) == 0 and x.name not in ['br', 'img']:
x.extract()
print(soup)
Giving:
<html><body>
<i>hello world</i>
</body></html>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With