Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML with Beautiful Soup. Return text from specific tag

I can parse the full argument of a html Tag addressing it over a unix shell script like this:

# !/usr/bin/python3

# import the module
from bs4 import BeautifulSoup

# define your object
soup = BeautifulSoup(open("test.html"))

# get the tag
print(soup(itemprop="name"))

where itemprop="name" uniquely identifies the required tag.

the output is something like

[<span itemprop="name">
                    Blabla &amp; Bloblo</span>]

Now I would like to return only the Bla Bla Blo Blo part.

my attempt was to do:

print(soup(itemprop="name").getText())

but I get an error message like AttributeError: 'ResultSet' object has no attribute 'getText'

it worked experimentally in other contexts such as

print(soup.find('span').getText())

So what am I getting wrong?

like image 567
joaoal Avatar asked Aug 12 '14 15:08

joaoal


1 Answers

Using the soup object as a callable returns a list of results, as if you used soup.find_all(). See the documentation:

Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object or a Tag object as though it were a function, then it’s the same as calling find_all() on that object.

Use soup.find() to find just the first match:

soup.find(itemprop="name").get_text()

or index into the resultset:

soup(itemprop="name")[0].get_text()
like image 161
Martijn Pieters Avatar answered Oct 16 '22 18:10

Martijn Pieters