Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup scraping nested tables

I have been trying to scrape the data from a website which is using a good amount of tables. I have been researching on the beautifulsoup documentation as well as here on stackoverflow but am still lost.

Here is the said table:

      <form action="/rr/" class="form">
        <table border="0" width="100%" cellpadding="2" cellspacing="0" align="left">
          <tr bgcolor="#6699CC">
            <td valign="top"><font face="arial"><b>Uesless Data</b></font></td>
    
            <td width="10%"><br /></td>
    
            <td align="right"><font face="arial">Uesless Data</font></td>
          </tr>
    
          <tr bgcolor="#DCDCDC">
            <td> <input size="12" name="s" value="data:" onfocus=
            "this.value = '';" /> <input type="hidden" name="d" value="research" />
    				
            <input type="submit" value="Date" /></td>
    
            <td width="10%"><br /></td>
    
          </tr>
        </table>
      </form>
    
      <table border="0" width="100%">
        <tr>
          <td></td>
        </tr>
      </table><br />
      <br />
    
      <table border="0" width="100%">
        <tr>
          <td valign="top" width="99%">
            <table cellpadding="2" cellspacing="0" border="0" width="100%">
              <tr bgcolor="#A0B8C8">
                <td colspan="6"><b>Data to be pulled</b></td>
              </tr>
    
              <tr bgcolor="#DCDCDC">
                <td><font face="arial"><b>Data to be pulled</b></font></td>
    
                <td><font face="arial"><b>Data to be pulled</b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
              </tr>
    
              <tr>
                <td>Data to be pulled</td>
    
                <td align="center">Data to be pulled</td>
    
                <td align="center">Data to be pulled</td>
    
                <td align="center">Data to be pulled</td>
    
                <td align="center"><br /></td>
              </tr>
    	    </table>
    	  </td>
    	</tr>
      </table>

There are quite a few tables, and none of which really have any distinguishing id's or tags. My most recent attempt was:

table = soup.find('table', attrs={'border':'0', 'width': "100%'})

Which is pulling only the first empty table. I feel like the answer is simple, and I am over thinking it.

like image 628
kayduh Avatar asked May 05 '15 21:05

kayduh


People also ask

How do you scrape a nested tag?

Step-by-step ApproachStep 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.

Is Scrapy faster than Beautifulsoup?

Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once.


1 Answers

If you're just looking for all of the tables, rather than the first one, you just want find_all instead of find.

If you're trying to find a particular table, like the one nested inside another one, and the page is using a 90s-style design that makes it impossible to find it via id or other attrs, the only option is to search by structure:

for table in soup.find_all('table'):
    for subtable in table.find_all('table'):
        # Found it!

And of course you can flatten this into a single comprehension if you really want to:

subtable = next(subtable for table in soup.find_all('table') 
                for subtable in table.find_all('table'))

Notice that I left off the attrs. If every table on the page has a superset of the same attrs, you aren't helping anything by specifying them.

This whole thing is obviously ugly and brittle… but there's really no way not to be brittle with this kind of layout.

Using a different library, like lxml.html, that lets you search by XPath might make it a little more compact, but it's ultimately going to be doing the same thing.

like image 61
abarnert Avatar answered Oct 21 '22 06:10

abarnert