I have an old website originally created in MS Frontpage that I'm trying to defrontpagify. I've written a BeautifulSoup script that does most of it. Only thing left is to remove empty tables, eg tables with no text content or data in any of their td
tags.
The problem I'm stuck on is that what I've tried so far removes the table if at least one its td
tags contains no data, even if others do. That removes all the tables in the entire document, including ones with data I want to preserve.
tags = soup.findAll('table',text=None,recursive=True)
[tag.extract() for tag in tags]
Any suggestions how to only remove tables in which none of the td tags contain any data? (I don't care if they contain img
or empty anchor tags, as long as there's no text).
Use the .text
property. It retrieves all text content (recursive) within that element.
Example:
from BeautifulSoup import BeautifulSoup as BS
html = """
<table id="empty">
<tr><td></td></tr>
</table>
<table id="with_text">
<tr><td>hey!</td></tr>
</table>
<table id="with_text_in_one_row">
<tr><td></td></tr>
<tr><td>hey!</td></tr>
</table>
<table id="no_text_but_img">
<tr><td><img></td></tr>
</table>
<table id="no_text_but_a">
<tr><td><a></a></td></tr>
</table>
<table id="text_in_a">
<tr><td><a>hey!</a></td></tr>
</table>
"""
soup = BS(html)
for table in soup.findAll("table" ,text=None,recursive=True):
if table.text:
print table["id"]
Outputs:
with_text
with_text_in_one_row
text_in_a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With