Observe the following problem:
import re from bs4 import BeautifulSoup as BS soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> Edit </a> """) # This returns the <a> element soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*") ) soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> <i class="fa fa-edit"></i> Edit </a> """) # This returns None soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*") )
For some reason, BeautifulSoup will not match the text, when the <i>
tag is there as well. Finding the tag and showing its text produces
>>> a2 = soup.find( 'a', href="/customer-menu/1/accounts/1/update" ) >>> print(repr(a2.text)) '\n Edit\n'
Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:
pattern = re.compile('.*Edit.*') pattern.match('\n Edit\n') # Returns None pattern = re.compile('.*Edit.*', flags=re.DOTALL) pattern.match('\n Edit\n') # Returns MatchObject
Alright. Looks good. Let's try it with soup
soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> <i class="fa fa-edit"></i> Edit </a> """) soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*", flags=re.DOTALL) ) # Still return None... Why?!
My solution based on geckons answer: I implemented these helpers:
import re MATCH_ALL = r'.*' def like(string): """ Return a compiled regular expression that matches the given string with any prefix and postfix, e.g. if string = "hello", the returned regex matches r".*hello.*" """ string_ = string if not isinstance(string_, str): string_ = str(string_) regex = MATCH_ALL + re.escape(string_) + MATCH_ALL return re.compile(regex, flags=re.DOTALL) def find_by_text(soup, text, tag, **kwargs): """ Find the tag in soup that matches all provided kwargs, and contains the text. If no match is found, return None. If more than one match is found, raise ValueError. """ elements = soup.find_all(tag, **kwargs) matches = [] for element in elements: if element.find(text=like(text)): matches.append(element) if len(matches) > 1: raise ValueError("Too many matches:\n" + "\n".join(matches)) elif len(matches) == 0: return None else: return matches[0]
Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
There are many Beautifulsoup methods, which allows us to search a parse tree. The two most common and used methods are find() and find_all(). Before talking about find() and find_all(), let us see some examples of different filters you can pass into these methods.
Approach: Here we first import the regular expressions and BeautifulSoup libraries. Then we open the HTML file using the open function which we want to parse. Then using the find_all function, we find a particular tag that we pass inside that function and also the text we want to have within the tag.
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
The problem is that your <a>
tag with the <i>
tag inside, doesn't have the string
attribute you expect it to have. First let's take a look at what text=""
argument for find()
does.
NOTE: The text
argument is an old name, since BeautifulSoup 4.4.0 it's called string
.
From the docs:
Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”:
soup.find_all("a", string="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
Now let's take a look what Tag
's string
attribute is (from the docs again):
If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
title_tag.string # u'The Dormouse's story'
(...)
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
print(soup.html.string) # None
This is exactly your case. Your <a>
tag contains a text and <i>
tag. Therefore, the find gets None
when trying to search for a string and thus it can't match.
How to solve this?
Maybe there is a better solution but I would probably go with something like this:
import re from bs4 import BeautifulSoup as BS soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> <i class="fa fa-edit"></i> Edit </a> """) links = soup.find_all('a', href="/customer-menu/1/accounts/1/update") for link in links: if link.find(text=re.compile("Edit")): thelink = link break print(thelink)
I think there are not too many links pointing to /customer-menu/1/accounts/1/update
so it should be fast enough.
in one line using lambda
soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With