<p>I have been trying to scrape the data from a website which is using a good amount of tables. I have been researching on the beautifulsoup documentation as well as here on stackoverflow but am still lost. </p> <p>Here is the said table:</p> <p></p> <div class="snippet" data-lang="js" data-hide="false"> <div class="snippet-code"> <pre class="prettyprint snippet-code-html lang-html prettyprint-override"><code> <form action="/rr/" class="form"> <table border="0" width="100%" cellpadding="2" cellspacing="0" align="left"> <tr bgcolor="#6699CC"> <td valign="top"><font face="arial"><b>Uesless Data</b></font></td> <td width="10%"><br /></td> <td align="right"><font face="arial">Uesless Data</font></td> </tr> <tr bgcolor="#DCDCDC"> <td> <input size="12" name="s" value="data:" onfocus= "this.value = '';" /> <input type="hidden" name="d" value="research" /> <input type="submit" value="Date" /></td> <td width="10%"><br /></td> </tr> </table> </form> <table border="0" width="100%"> <tr> <td></td> </tr> </table><br /> <br /> <table border="0" width="100%"> <tr> <td valign="top" width="99%"> <table cellpadding="2" cellspacing="0" border="0" width="100%"> <tr bgcolor="#A0B8C8"> <td colspan="6"><b>Data to be pulled</b></td> </tr> <tr bgcolor="#DCDCDC"> <td><font face="arial"><b>Data to be pulled</b></font></td> <td><font face="arial"><b>Data to be pulled</b></font></td> <td align="center"><font face="arial"><b>Data to be pulled </b></font></td> <td align="center"><font face="arial"><b>Data to be pulled </b></font></td> <td align="center"><font face="arial"><b>Data to be pulled </b></font></td> <td align="center"><font face="arial"><b>Data to be pulled </b></font></td> </tr> <tr> <td>Data to be pulled</td> <td align="center">Data to be pulled</td> <td align="center">Data to be pulled</td> <td align="center">Data to be pulled</td> <td align="center"><br /></td> </tr> </table> </td> </tr> </table></code></pre> </div> </div> <p>There are quite a few tables, and none of which really have any distinguishing id's or tags. My most recent attempt was:</p> <pre class="prettyprint"><code>table = soup.find('table', attrs={'border':'0', 'width': "100%'}) </code></pre> <p>Which is pulling only the first empty table. I feel like the answer is simple, and I am over thinking it.</p>

<p>If you're just looking for all of the tables, rather than the first one, you just want <code>find_all</code> instead of <code>find</code>.</p> <p>If you're trying to find a particular table, like the one nested inside another one, and the page is using a 90s-style design that makes it impossible to find it via <code>id</code> or other attrs, the only option is to search by structure:</p> <pre class="prettyprint"><code>for table in soup.find_all('table'): for subtable in table.find_all('table'): # Found it! </code></pre> <p>And of course you can flatten this into a single comprehension if you really want to:</p> <pre class="prettyprint"><code>subtable = next(subtable for table in soup.find_all('table') for subtable in table.find_all('table')) </code></pre> <p>Notice that I left off the <code>attrs</code>. If every table on the page has a superset of the same attrs, you aren't helping anything by specifying them.</p> <p>This whole thing is obviously ugly and brittle… but there's really no way <em>not</em> to be brittle with this kind of layout.</p> <p>Using a different library, like <code>lxml.html</code>, that lets you search by XPath might make it a little more compact, but it's ultimately going to be doing the same thing.</p>

BeautifulSoup scraping nested tables

Tags:

python

html-parsing

beautifulsoup

I have been trying to scrape the data from a website which is using a good amount of tables. I have been researching on the beautifulsoup documentation as well as here on stackoverflow but am still lost.

Here is the said table:

      <form action="/rr/" class="form">
        <table border="0" width="100%" cellpadding="2" cellspacing="0" align="left">
          <tr bgcolor="#6699CC">
            <td valign="top"><font face="arial"><b>Uesless Data</b></font></td>
    
            <td width="10%"><br /></td>
    
            <td align="right"><font face="arial">Uesless Data</font></td>
          </tr>
    
          <tr bgcolor="#DCDCDC">
            <td> <input size="12" name="s" value="data:" onfocus=
            "this.value = '';" /> <input type="hidden" name="d" value="research" />
    				
            <input type="submit" value="Date" /></td>
    
            <td width="10%"><br /></td>
    
          </tr>
        </table>
      </form>
    
      <table border="0" width="100%">
        <tr>
          <td></td>
        </tr>
      </table><br />
      <br />
    
      <table border="0" width="100%">
        <tr>
          <td valign="top" width="99%">
            <table cellpadding="2" cellspacing="0" border="0" width="100%">
              <tr bgcolor="#A0B8C8">
                <td colspan="6"><b>Data to be pulled</b></td>
              </tr>
    
              <tr bgcolor="#DCDCDC">
                <td><font face="arial"><b>Data to be pulled</b></font></td>
    
                <td><font face="arial"><b>Data to be pulled</b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
              </tr>
    
              <tr>
                <td>Data to be pulled</td>
    
                <td align="center">Data to be pulled</td>
    
                <td align="center">Data to be pulled</td>
    
                <td align="center">Data to be pulled</td>
    
                <td align="center"><br /></td>
              </tr>
    	    </table>
    	  </td>
    	</tr>
      </table>

There are quite a few tables, and none of which really have any distinguishing id's or tags. My most recent attempt was:

table = soup.find('table', attrs={'border':'0', 'width': "100%'})

Which is pulling only the first empty table. I feel like the answer is simple, and I am over thinking it.

628

asked May 05 '15 21:05

kayduh

1 Answers

If you're just looking for all of the tables, rather than the first one, you just want find_all instead of find.

If you're trying to find a particular table, like the one nested inside another one, and the page is using a 90s-style design that makes it impossible to find it via id or other attrs, the only option is to search by structure:

for table in soup.find_all('table'):
    for subtable in table.find_all('table'):
        # Found it!

And of course you can flatten this into a single comprehension if you really want to:

subtable = next(subtable for table in soup.find_all('table') 
                for subtable in table.find_all('table'))

Notice that I left off the attrs. If every table on the page has a superset of the same attrs, you aren't helping anything by specifying them.

This whole thing is obviously ugly and brittle… but there's really no way not to be brittle with this kind of layout.

Using a different library, like lxml.html, that lets you search by XPath might make it a little more compact, but it's ultimately going to be doing the same thing.

answered Oct 21 '22 06:10

abarnert

Related questions
                            
                                Bokeh Session and Document Polling
                            
                                Using IDE on AWS EC2
                            
                                Creating mTurk HIT from Layout with parameters using boto and python
                            
                                Django: How can I update the profile pictures via ModelForm?
                            
                                Python: double sort
                            
                                Pypi upload without a .pypirc?
                            
                                Python "print" not working when embedded into MPI program
                            
                                In python, how can I change the font size of leaf nodes when generating phylogenetic trees using Bio.Phylo.draw()?
                            
                                error when opening python in terminal
                            
                                Railroad diagram for Python grammar
                            
                                Using Cython with Asyncio (Python 3.4)
                            
                                Ignore additional keyword arguments in python [duplicate]
                            
                                How to print symbols like ● to files in Python
                            
                                How to debug a python - c++ program
                            
                                Create adjacency matrix in python from csv dataset
                            
                                OOP - organising big classes [closed]
                            
                                How can I get Sphinx autosummary to display the docs for an instance attributes?
                            
                                Python ThreadPool from multiprocessing.pool cannot ultilize all CPUs
                            
                                not getting all cookie info using python requests module
                            
                                scitkit-learn query data dimension must match training data dimension

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With