Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup, extracting strings within HTML tags, ResultSet objects

I am confused exactly how I can use the ResultSet object with BeautifulSoup, i.e. bs4.element.ResultSet.

After using find_all(), how can one extract text?

Example:

In the bs4 documentation, the HTML document html_doc looks like:

<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>

One begins by creating the soup and finding all href,

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')

which outputs

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

We could also do

for link in soup.find_all('a'):
    print(link.get('href'))

which outputs

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

I would like to get only the text from the class_="sister", i.e.

Elsie
Lacie
Tillie

One could try

for link in soup.find_all('a'):
    print(link.get_text())

but this results in an error:

AttributeError: 'ResultSet' object has no attribute 'get_text'
like image 843
ShanZhengYang Avatar asked Jan 08 '23 06:01

ShanZhengYang


1 Answers

Do a find_all() filtering on class_='sister'.

Note: Notice the underscore after class. It's a special case because class is a reserved word.

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

Source: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

Once you have all of the tags with class sister, call .text on them to get the text. Be sure to strip the text.

For example:

from bs4 import BeautifulSoup

html_doc = '''<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>'''

soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
    print tag.text.strip()

Output:

(bs4)macbook:bs4 joeyoung$ python bs4demo.py
Elsie
Lacie
Tillie
like image 121
Joe Young Avatar answered Jan 23 '23 16:01

Joe Young