Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautifulsoup find element by text using `find_all` no matter if there are elements in it

For example

bs = BeautifulSoup("<html><a>sometext</a></html>")
print bs.find_all("a",text=re.compile(r"some"))

returns [<a>sometext</a>] but when element searched for has a child, i.e. img

bs = BeautifulSoup("<html><a>sometext<img /></a></html>")
print bs.find_all("a",text=re.compile(r"some"))

it returns []

Is there a way to use find_all to match the later example?

like image 934
Bula Avatar asked Apr 18 '13 18:04

Bula


People also ask

What is the difference between Find_all () and find () in BeautifulSoup?

find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document. It is used for getting merely the first tag of the incoming HTML object for which condition is satisfied.

What does Find_all return BeautifulSoup?

find_all returns an object of ResultSet which offers index based access to the result of found occurrences and can be printed using a for loop.

What method in BeautifulSoup will get the text from an element object?

BeautifulSoup has a built-in method to parse the text out of an element, which is get_text() . In order to use it, you can simply call the method on any Tag or BeautifulSoup object. get_text() does not work on NavigableString because the object itself represents a string.


1 Answers

You will need to use a hybrid approach since text= will fail when an element has child elements as well as text.

bs = BeautifulSoup("<html><a>sometext</a></html>")    
reg = re.compile(r'some')
elements = [e for e in bs.find_all('a') if reg.match(e.text)]

Background

When BeautifulSoup is searching for an element, and text is a callable, it eventually eventually calls:

self._matches(found.string, self.text)

In the two examples you gave, the .string method returns different things:

>>> bs1 = BeautifulSoup("<html><a>sometext</a></html>")
>>> bs1.find('a').string
u'sometext'
>>> bs2 = BeautifulSoup("<html><a>sometext<img /></a></html>")
>>> bs2.find('a').string
>>> print bs2.find('a').string
None

The .string method looks like this:

@property
def string(self):
    """Convenience property to get the single string within this tag.

    :Return: If this tag has a single string child, return value
     is that string. If this tag has no children, or more than one
     child, return value is None. If this tag has one child tag,
     return value is the 'string' attribute of the child tag,
     recursively.
    """
    if len(self.contents) != 1:
        return None
    child = self.contents[0]
    if isinstance(child, NavigableString):
        return child
    return child.string

If we print out the contents we can see why this returns None:

>>> print bs1.find('a').contents
[u'sometext']
>>> print bs2.find('a').contents
[u'sometext', <img/>]
like image 191
Nathan Villaescusa Avatar answered Nov 26 '22 00:11

Nathan Villaescusa