Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all tables in html using BeautifulSoup

I want to find all tables in html using BeautifulSoup. Inner tables should be included in outer tables.

I have created some code which works and it gives expected output. But, I don't like this solution, because it uses .decompose() which destroys the'soup' object.

Do you know how to do it in more elegant way?

from BeautifulSoup import BeautifulSoup as bs

input = '''<html><head><title>title</title></head>
<body>
<p>paragraph</p>
<div><div>
    <table>table1<table>inner11<table>inner12</table></table></table>
    <div><table>table2<table>inner2</table></table></div>
</div></div>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
</html>'''

soup = bs(input)
while(True):
    t=soup.find("table")
    if t is None:
        break
    print str(t)
    t.decompose()

Output:

<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table> 
like image 732
Ivan Sas Avatar asked Mar 20 '12 08:03

Ivan Sas


People also ask

How extract HTML table data from Python?

For this, you can use different python libraries that help you extract content from the HTML table. One such method is available in the popular python Pandas library, it is called read_html(). The method accepts numerous arguments that allow you to customize how the table will be parsed.

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.


1 Answers

use soup.findAll("table") instead of find() and decompose() :

tables = soup.findAll("table")

for table in tables:
     if table.findParent("table") is None:
         print str(table)

output :

<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>

and nothing gets destroyed/destructed.

like image 200
WeaselFox Avatar answered Oct 19 '22 12:10

WeaselFox