I am confused exactly how I can use the ResultSet object with BeautifulSoup, i.e. bs4.element.ResultSet
.
After using find_all()
, how can one extract text?
Example:
In the bs4
documentation, the HTML document html_doc
looks like:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>
One begins by creating the soup
and finding all href
,
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')
which outputs
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
We could also do
for link in soup.find_all('a'):
print(link.get('href'))
which outputs
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
I would like to get only the text from the class_="sister"
, i.e.
Elsie
Lacie
Tillie
One could try
for link in soup.find_all('a'):
print(link.get_text())
but this results in an error:
AttributeError: 'ResultSet' object has no attribute 'get_text'
Do a find_all()
filtering on class_='sister'
.
Note: Notice the underscore after class
. It's a special case because class is a reserved word.
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument
class_
:
Source: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
Once you have all of the tags with class sister, call .text
on them to get the text. Be sure to strip the text.
For example:
from bs4 import BeautifulSoup
html_doc = '''<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>'''
soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
print tag.text.strip()
Output:
(bs4)macbook:bs4 joeyoung$ python bs4demo.py
Elsie
Lacie
Tillie
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With