I want to find all tables in html using BeautifulSoup. Inner tables should be included in outer tables.
I have created some code which works and it gives expected output. But, I don't like this solution, because it uses .decompose()
which destroys the'soup' object.
Do you know how to do it in more elegant way?
from BeautifulSoup import BeautifulSoup as bs
input = '''<html><head><title>title</title></head>
<body>
<p>paragraph</p>
<div><div>
<table>table1<table>inner11<table>inner12</table></table></table>
<div><table>table2<table>inner2</table></table></div>
</div></div>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
</html>'''
soup = bs(input)
while(True):
t=soup.find("table")
if t is None:
break
print str(t)
t.decompose()
Output:
<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
For this, you can use different python libraries that help you extract content from the HTML table. One such method is available in the popular python Pandas library, it is called read_html(). The method accepts numerous arguments that allow you to customize how the table will be parsed.
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
use soup.findAll("table")
instead of find()
and decompose()
:
tables = soup.findAll("table")
for table in tables:
if table.findParent("table") is None:
print str(table)
output :
<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
and nothing gets destroyed/destructed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With