For example
bs = BeautifulSoup("<html><a>sometext</a></html>")
print bs.find_all("a",text=re.compile(r"some"))
returns [<a>sometext</a>]
but when element searched for has a child, i.e. img
bs = BeautifulSoup("<html><a>sometext<img /></a></html>")
print bs.find_all("a",text=re.compile(r"some"))
it returns []
Is there a way to use find_all
to match the later example?
find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document. It is used for getting merely the first tag of the incoming HTML object for which condition is satisfied.
find_all returns an object of ResultSet which offers index based access to the result of found occurrences and can be printed using a for loop.
BeautifulSoup has a built-in method to parse the text out of an element, which is get_text() . In order to use it, you can simply call the method on any Tag or BeautifulSoup object. get_text() does not work on NavigableString because the object itself represents a string.
You will need to use a hybrid approach since text=
will fail when an element has child elements as well as text.
bs = BeautifulSoup("<html><a>sometext</a></html>")
reg = re.compile(r'some')
elements = [e for e in bs.find_all('a') if reg.match(e.text)]
When BeautifulSoup is searching for an element, and text
is a callable, it eventually eventually calls:
self._matches(found.string, self.text)
In the two examples you gave, the .string
method returns different things:
>>> bs1 = BeautifulSoup("<html><a>sometext</a></html>")
>>> bs1.find('a').string
u'sometext'
>>> bs2 = BeautifulSoup("<html><a>sometext<img /></a></html>")
>>> bs2.find('a').string
>>> print bs2.find('a').string
None
The .string
method looks like this:
@property
def string(self):
"""Convenience property to get the single string within this tag.
:Return: If this tag has a single string child, return value
is that string. If this tag has no children, or more than one
child, return value is None. If this tag has one child tag,
return value is the 'string' attribute of the child tag,
recursively.
"""
if len(self.contents) != 1:
return None
child = self.contents[0]
if isinstance(child, NavigableString):
return child
return child.string
If we print out the contents we can see why this returns None
:
>>> print bs1.find('a').contents
[u'sometext']
>>> print bs2.find('a').contents
[u'sometext', <img/>]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With