Cannot chain find and find_all in BeautifulSoup

Tags:

I have a book and docs on BeautifulSoup. Both say I should be able to chain find/find_all methods and use subscripts to get exactly what I want from a single page scrape. This does not appear to be the case. Consider the following table.

<tr>
<td><span style="display:none;" class="sortkey">Dresser !</span><span class="sorttext">**<a href="/wiki/Louise_Dresser" title="Louise Dresser">Louise Dresser</a>**</span></td>
<td><span style="display:none;" class="sortkey">Ship !</span><span class="sorttext"><i><a href="/wiki/A_Ship_Comes_In" title="A Ship Comes In">A Ship Comes In</a></i></span></td>
<td><span style="display:none;" class="sortkey">Pleznik !</span><span class="sorttext">Mrs. Pleznik</span></td>
</tr>
<tr>
<td><span style="display:none;" class="sortkey">Swanson !</span><span class="sorttext"><a href="/wiki/Gloria_Swanson" title="Gloria Swanson">Gloria Swanson</a></span></td>
<td><i><a href="/wiki/Sadie_Thompson" title="Sadie Thompson">Sadie Thompson</a></i></td>
<td><span style="display:none;" class="sortkey">Thompson !</span><span class="sorttext">Sadie Thompson</span></td>
</tr>
<tr>
<th scope="row" rowspan="6" style="text-align:center"><a href="/wiki/1928_in_film" title="1928 in film">1928</a>/<a href="/wiki/1929_in_film" title="1929 in film">29</a><br />
<small><a href="/wiki/2nd_Academy_Awards" title="2nd Academy Awards">(2nd)</a></small></th>
<td style="background:#FAEB86"><b><span style="display:none;" class="sortkey">Pickford !</span><span class="sorttext">**<a href="/wiki/Mary_Pickford" title="Mary Pickford">Mary Pickford</a>**</span> <img alt="Award winner" src="//upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" width="9" height="14" data-file-width="9" data-file-height="14" /></b></td>

For every table row, I need to grab the first element, then the text inside of the first nested tag. Lousie Dresser would be the first data point, followed by Gloria Swanson, and then Mary Pickford.

I thought the following would get me there, but I was wrong and 6 hours later I am spent.

def getActresses(URL):
    try:
        html = urlopen(URL)
    except HTTPError:
        print("Page not found.")
        return None
    try:
        bsObj = BeautifulSoup(html, "lxml")
        soup = bsObj.find("table", {"class":"wikitable sortable"})
    except AttributeError:
        print("Error creating/navigating soup object")
    data = soup.find_all("tr").find_all("td").find("a").get_text()
    print(data)


getActresses("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress")

This isn't the only code I've tried. I've tried looping through rows, then table data cells, then accessing a tags. I've tried asking for a tags and then regexing them out, only to be told I couldn't have the text I wanted. The most frequent error I've gotten when trying to chain operations (as above) is AttributeError: 'ResultSet' object has no attribute 'find'. Subscripting absolutely doesn't work, even when replicating book examples (go fig?!). Also, I've had processes abort themselves, which I didn't know was possible.

Thoughts on what's going on and why something that should be so simple seems to be such an event would be enormously appreciated.

448

asked Jul 28 '17 03:07

Ryan

1 Answers

import requests
from bs4 import BeautifulSoup

def getActresses(URL):
    res = requests.get(URL)

    try:
        soup = BeautifulSoup(res.content, "lxml")
        table = soup.find("table", {"class":"wikitable sortable"})
    except AttributeError:
        print("Error creating/navigating soup object")

    tr = table.find_all("tr")

    for _tr in tr:
        td = _tr.find_all("td")
        for _td in td:
            a = _td.find_all("a")
            for _a in a:
                print(_a.text.encode("utf-8"))

getActresses("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress")

use the text instead of get_text() and sorry I used requests module to demonstrate

find_all method always return a list so you have to loop through it

answered Oct 24 '22 20:10

Jeeva

Related questions
                            
                                How to release the GIL in Cython for a multithreaded C++ class?
                            
                                scikit-learn classification on soft labels
                            
                                Get SharePoint List with Python
                            
                                Run / Debug a Django application's UnitTests from the mouse right click context menu in PyCharm Community Edition?
                            
                                Update values of a list of dictionaries in python
                            
                                Best way to construct a binary tree from a list in python
                            
                                tensorflow: check if a scalar boolean tensor is True
                            
                                Python output above the last printed line
                            
                                Pandas: Fill NaNs with next non-NaN / # consecutive NaNs
                            
                                How to put all legend entries on one line?
                            
                                How do I use an InfiniBand network with Dask?
                            
                                Matplotlib change colormap tab20 to have three colors
                            
                                How to annotate Django view's methods?
                            
                                How to Add item to string_set on Dynamodb with Boto3
                            
                                BeautifulSoup.find_all() method not working with namespaced tags
                            
                                Python BeautifulSoup, iterating through tags and attributes
                            
                                Vim and python - jump to definition key binding
                            
                                ConfigParser - Create file if it doesn't exist
                            
                                Python decorators count function call
                            
                                Fitting a polynomial using np.polyfit in 3 dimensions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cannot chain find and find_all in BeautifulSoup

Tags:

python

beautifulsoup

web-scraping

Ryan

People also ask

1 Answers

Jeeva

Recent Activity

Donate For Us