Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: How to remove empty tables, while preserving tables that are partially empty or not empty

I have an old website originally created in MS Frontpage that I'm trying to defrontpagify. I've written a BeautifulSoup script that does most of it. Only thing left is to remove empty tables, eg tables with no text content or data in any of their td tags.

The problem I'm stuck on is that what I've tried so far removes the table if at least one its td tags contains no data, even if others do. That removes all the tables in the entire document, including ones with data I want to preserve.

tags = soup.findAll('table',text=None,recursive=True) 
[tag.extract() for tag in tags]

Any suggestions how to only remove tables in which none of the td tags contain any data? (I don't care if they contain img or empty anchor tags, as long as there's no text).

like image 792
Kurtosis Avatar asked Oct 09 '22 20:10

Kurtosis


1 Answers

Use the .text property. It retrieves all text content (recursive) within that element.

Example:

from BeautifulSoup import BeautifulSoup as BS

html = """
<table id="empty">
  <tr><td></td></tr>
</table>

<table id="with_text">
  <tr><td>hey!</td></tr>
</table>

<table id="with_text_in_one_row">
  <tr><td></td></tr>
  <tr><td>hey!</td></tr>
</table>

<table id="no_text_but_img">
  <tr><td><img></td></tr>
</table>

<table id="no_text_but_a">
  <tr><td><a></a></td></tr>
</table>

<table id="text_in_a">
  <tr><td><a>hey!</a></td></tr>
</table>

"""

soup = BS(html)
for table in soup.findAll("table" ,text=None,recursive=True):
    if table.text:
        print table["id"]

Outputs:

with_text
with_text_in_one_row
text_in_a
like image 147
Avaris Avatar answered Oct 12 '22 11:10

Avaris