I have a book and docs on BeautifulSoup. Both say I should be able to chain find/find_all methods and use subscripts to get exactly what I want from a single page scrape. This does not appear to be the case. Consider the following table.
<tr>
<td><span style="display:none;" class="sortkey">Dresser !</span><span class="sorttext">**<a href="/wiki/Louise_Dresser" title="Louise Dresser">Louise Dresser</a>**</span></td>
<td><span style="display:none;" class="sortkey">Ship !</span><span class="sorttext"><i><a href="/wiki/A_Ship_Comes_In" title="A Ship Comes In">A Ship Comes In</a></i></span></td>
<td><span style="display:none;" class="sortkey">Pleznik !</span><span class="sorttext">Mrs. Pleznik</span></td>
</tr>
<tr>
<td><span style="display:none;" class="sortkey">Swanson !</span><span class="sorttext"><a href="/wiki/Gloria_Swanson" title="Gloria Swanson">Gloria Swanson</a></span></td>
<td><i><a href="/wiki/Sadie_Thompson" title="Sadie Thompson">Sadie Thompson</a></i></td>
<td><span style="display:none;" class="sortkey">Thompson !</span><span class="sorttext">Sadie Thompson</span></td>
</tr>
<tr>
<th scope="row" rowspan="6" style="text-align:center"><a href="/wiki/1928_in_film" title="1928 in film">1928</a>/<a href="/wiki/1929_in_film" title="1929 in film">29</a><br />
<small><a href="/wiki/2nd_Academy_Awards" title="2nd Academy Awards">(2nd)</a></small></th>
<td style="background:#FAEB86"><b><span style="display:none;" class="sortkey">Pickford !</span><span class="sorttext">**<a href="/wiki/Mary_Pickford" title="Mary Pickford">Mary Pickford</a>**</span> <img alt="Award winner" src="//upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" width="9" height="14" data-file-width="9" data-file-height="14" /></b></td>
For every table row, I need to grab the first element, then the text inside of the first nested tag. Lousie Dresser would be the first data point, followed by Gloria Swanson, and then Mary Pickford.
I thought the following would get me there, but I was wrong and 6 hours later I am spent.
def getActresses(URL):
try:
html = urlopen(URL)
except HTTPError:
print("Page not found.")
return None
try:
bsObj = BeautifulSoup(html, "lxml")
soup = bsObj.find("table", {"class":"wikitable sortable"})
except AttributeError:
print("Error creating/navigating soup object")
data = soup.find_all("tr").find_all("td").find("a").get_text()
print(data)
getActresses("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress")
This isn't the only code I've tried. I've tried looping through rows, then table data cells, then accessing a tags. I've tried asking for a tags and then regexing them out, only to be told I couldn't have the text I wanted. The most frequent error I've gotten when trying to chain operations (as above) is AttributeError: 'ResultSet' object has no attribute 'find'.
Subscripting absolutely doesn't work, even when replicating book examples (go fig?!). Also, I've had processes abort themselves, which I didn't know was possible.
Thoughts on what's going on and why something that should be so simple seems to be such an event would be enormously appreciated.
Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
import requests
from bs4 import BeautifulSoup
def getActresses(URL):
res = requests.get(URL)
try:
soup = BeautifulSoup(res.content, "lxml")
table = soup.find("table", {"class":"wikitable sortable"})
except AttributeError:
print("Error creating/navigating soup object")
tr = table.find_all("tr")
for _tr in tr:
td = _tr.find_all("td")
for _td in td:
a = _td.find_all("a")
for _a in a:
print(_a.text.encode("utf-8"))
getActresses("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress")
use the text
instead of get_text()
and sorry I used requests
module to demonstrate
find_all
method always return a list so you have to loop through it
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With