<p>I want to find all tables in html using BeautifulSoup. Inner tables should be included in outer tables.</p> <p>I have created some code which works and it gives expected output. But, I don't like this solution, because it uses <code>.decompose()</code> which destroys the'soup' object.</p> <p>Do you know how to do it in more elegant way?</p> <pre class="prettyprint"><code>from BeautifulSoup import BeautifulSoup as bs input = '''<html><head><title>title</title></head> <body> <p>paragraph</p> <div><div> <table>table1<table>inner11<table>inner12</table></table></table> <div><table>table2<table>inner2</table></table></div> </div></div> <table>table3<table>inner3</table></table> <table>table4<table>inner4</table></table> </html>''' soup = bs(input) while(True): t=soup.find("table") if t is None: break print str(t) t.decompose() </code></pre> <p>Output:</p> <pre class="prettyprint"><code><table>table1<table>inner11<table>inner12</table></table></table> <table>table2<table>inner2</table></table> <table>table3<table>inner3</table></table> <table>table4<table>inner4</table></table> </code></pre>

<p>use <code>soup.findAll("table")</code> instead of <code>find()</code> and <code>decompose()</code> :</p> <pre class="prettyprint"><code>tables = soup.findAll("table") for table in tables: if table.findParent("table") is None: print str(table) </code></pre> <p>output : </p> <pre class="prettyprint"><code><table>table1<table>inner11<table>inner12</table></table></table> <table>table2<table>inner2</table></table> <table>table3<table>inner3</table></table> <table>table4<table>inner4</table></table> </code></pre> <p>and nothing gets destroyed/destructed.</p>

Find all tables in html using BeautifulSoup

Do you know how to do it in more elegant way?

from BeautifulSoup import BeautifulSoup as bs

input = '''<html><head><title>title</title></head>
<body>
<p>paragraph</p>
<div><div>
    <table>table1<table>inner11<table>inner12</table></table></table>
    <div><table>table2<table>inner2</table></table></div>
</div></div>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
</html>'''

soup = bs(input)
while(True):
    t=soup.find("table")
    if t is None:
        break
    print str(t)
    t.decompose()

Output:

<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>

732

asked Mar 20 '12 08:03

Ivan Sas

1 Answers

use soup.findAll("table") instead of find() and decompose() :

tables = soup.findAll("table")

for table in tables:
     if table.findParent("table") is None:
         print str(table)

output :

<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>

and nothing gets destroyed/destructed.

200

answered Oct 19 '22 12:10

WeaselFox

Related questions
                            
                                Correct way to put long function calls on multiple lines
                            
                                Forwarding an email with python smtplib
                            
                                Programmatically sync the db in Django
                            
                                How can I access namespaced XML elements using BeautifulSoup?
                            
                                Cross platform way to list disk drives on Linux, Windows and Mac using Python?
                            
                                How to detect ESCape keypress in Python?
                            
                                Using Django auth User model as a Foreignkey and reverse relations
                            
                                Null matrix with constant diagonal, with same shape as another matrix
                            
                                Most efficient way in Python to iterate over a large file (10GB+)
                            
                                Solr: best documented, easy to use, stable Python APIs
                            
                                Encountered invalid value when I use pearsonr
                            
                                How is hash(None) calculated?
                            
                                Syslog messages show up as "Unknown" when I use Python's logging.handlers.SysLogHandler
                            
                                Get the (multiplicative) product of a tuple or list?
                            
                                Tornado celery integration hacks
                            
                                How do I deploy web2py on PythonAnywhere?
                            
                                Sorting by multiple params in pyes and elasticsearch
                            
                                TypeError: unbound method "method name" must be called with "Class name" instance as first argument (got str instance instead)
                            
                                How do you use tornado.testing for creating WebSocket unit tests?
                            
                                How to create Celery Windows Service?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find all tables in html using BeautifulSoup

Tags:

python

beautifulsoup

screen-scraping

Ivan Sas

People also ask

1 Answers

WeaselFox

Recent Activity

Donate For Us