<p>I am trying to extract data from several web pages which are not uniform in how they display their tables. I need to write code that will search for a text string and then go to the table immediately following that specific text string. Then I want to extract the contents of that table. Here's what I've got so far:</p> <pre class="prettyprint"><code>from BeautifulSoup import BeautifulSoup, SoupStrainer import re html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>'] soup = BeautifulSoup(''.join(html)) searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space foundtext = soup.findAll('p',text=searchtext) soupafter = foundtext.findAllNext() table = soupafter.find('table') # find the next table after the search string is found rows = table.findAll('tr') for tr in rows: cols = tr.findAll('td') for td in cols: try: text = ''.join(td.find(text=True)) except Exception: text = "" print text+"|", print </code></pre> <p>However, I get the following error:</p> <pre class="prettyprint"><code> soupafter = foundtext.findAllNext() AttributeError: 'ResultSet' object has no attribute 'findAllNext' </code></pre> <p>Is there an easy way to do this using BeautifulSoup?</p>

<p>The error is due to the fact that <code>findAllNext</code> is a method of <code>Tag</code> objects, but <code>foundtext</code> is a <code>ResultSet</code> object, which is a <em>list</em> of matching tags or strings. You could iterate through the each of the tags in <code>foundtext</code>, but depending on your needs it might be sufficient to use <code>find</code>, which returns only the first matching tag.</p> <p>Here's a modified version of your code. After changing <code>foundtext</code> to use <code>soup.find</code>, I found and fixed the same problem with <code>table</code>. I modified your regex to ignore whitespace between the words:</p> <pre class="prettyprint"><code>from BeautifulSoup import BeautifulSoup, SoupStrainer import re html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>'] soup = BeautifulSoup(''.join(html)) searchtext = re.compile(r'Table\s+1',re.IGNORECASE) foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text table = foundtext.findNext('table') # Find the first <table> tag that follows it rows = table.findAll('tr') for tr in rows: cols = tr.findAll('td') for td in cols: try: text = ''.join(td.find(text=True)) except Exception: text = "" print text+"|", print </code></pre> <p>This outputs:</p> <pre class="prettyprint"><code>1. row 1, cell 1| 1. row 1, cell 2| 1. row 2, cell 1| 1. row 2, cell 2| </code></pre>

How can I find a table after a text string using BeautifulSoup in Python?

Tags:

python

beautifulsoup

web-scraping

I am trying to extract data from several web pages which are not uniform in how they display their tables. I need to write code that will search for a text string and then go to the table immediately following that specific text string. Then I want to extract the contents of that table. Here's what I've got so far:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space
foundtext = soup.findAll('p',text=searchtext)
soupafter = foundtext.findAllNext()
table = soupafter.find('table') # find the next table after the search string is found
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
print

However, I get the following error:

    soupafter = foundtext.findAllNext()
AttributeError: 'ResultSet' object has no attribute 'findAllNext'

Is there an easy way to do this using BeautifulSoup?

877

asked Apr 19 '11 04:04

Josh Lee

1 Answers

The error is due to the fact that findAllNext is a method of Tag objects, but foundtext is a ResultSet object, which is a list of matching tags or strings. You could iterate through the each of the tags in foundtext, but depending on your needs it might be sufficient to use find, which returns only the first matching tag.

Here's a modified version of your code. After changing foundtext to use soup.find, I found and fixed the same problem with table. I modified your regex to ignore whitespace between the words:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Table\s+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
    print

This outputs:

1. row 1, cell 1| 1. row 1, cell 2|
1. row 2, cell 1| 1. row 2, cell 2|

148

answered Oct 21 '22 02:10

Josh Rosen

Related questions
                            
                                Active texturing with pygame (possible? what concepts to look into?)
                            
                                Creating a Hierarchical Build with SCons
                            
                                Python number wrapping?
                            
                                Preventing window overlap in GTK
                            
                                App Engine Version, Memcache
                            
                                How to schedule an event in python without multithreading?
                            
                                Python GeoModel alternative
                            
                                What is the canonical way of handling different types in Python?
                            
                                Python regex to match IP-address with /CIDR
                            
                                Is it possible to have a Python class decorator with arguments?
                            
                                initial_data fixture management in django
                            
                                Python loop to [:-1]
                            
                                What does timeit gain by turning off garbage collection?
                            
                                How to access a superclass's class attributes in Python?
                            
                                Python web framework with CRUD powered by AJAX [closed]
                            
                                Summing up digits !
                            
                                How to properly url encode accents?
                            
                                Is there a simple way to write an ODT using Python?
                            
                                Semantics of SUID (Set-User-ID)
                            
                                How to concisely represent if/else to specify CSS classes in Django templates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With