Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup - search by text inside a tag

Observe the following problem:

import re from bs4 import BeautifulSoup as BS  soup = BS(""" <a href="/customer-menu/1/accounts/1/update">     Edit </a> """)  # This returns the <a> element soup.find(     'a',     href="/customer-menu/1/accounts/1/update",     text=re.compile(".*Edit.*") )  soup = BS(""" <a href="/customer-menu/1/accounts/1/update">     <i class="fa fa-edit"></i> Edit </a> """)  # This returns None soup.find(     'a',     href="/customer-menu/1/accounts/1/update",     text=re.compile(".*Edit.*") ) 

For some reason, BeautifulSoup will not match the text, when the <i> tag is there as well. Finding the tag and showing its text produces

>>> a2 = soup.find(         'a',         href="/customer-menu/1/accounts/1/update"     ) >>> print(repr(a2.text)) '\n Edit\n' 

Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:

pattern = re.compile('.*Edit.*') pattern.match('\n Edit\n')  # Returns None  pattern = re.compile('.*Edit.*', flags=re.DOTALL) pattern.match('\n Edit\n')  # Returns MatchObject 

Alright. Looks good. Let's try it with soup

soup = BS(""" <a href="/customer-menu/1/accounts/1/update">     <i class="fa fa-edit"></i> Edit </a> """)  soup.find(     'a',     href="/customer-menu/1/accounts/1/update",     text=re.compile(".*Edit.*", flags=re.DOTALL) )  # Still return None... Why?! 

Edit

My solution based on geckons answer: I implemented these helpers:

import re  MATCH_ALL = r'.*'   def like(string):     """     Return a compiled regular expression that matches the given     string with any prefix and postfix, e.g. if string = "hello",     the returned regex matches r".*hello.*"     """     string_ = string     if not isinstance(string_, str):         string_ = str(string_)     regex = MATCH_ALL + re.escape(string_) + MATCH_ALL     return re.compile(regex, flags=re.DOTALL)   def find_by_text(soup, text, tag, **kwargs):     """     Find the tag in soup that matches all provided kwargs, and contains the     text.      If no match is found, return None.     If more than one match is found, raise ValueError.     """     elements = soup.find_all(tag, **kwargs)     matches = []     for element in elements:         if element.find(text=like(text)):             matches.append(element)     if len(matches) > 1:         raise ValueError("Too many matches:\n" + "\n".join(matches))     elif len(matches) == 0:         return None     else:         return matches[0] 

Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

like image 813
Eldamir Avatar asked Aug 12 '15 07:08

Eldamir


People also ask

How do you search in BeautifulSoup?

There are many Beautifulsoup methods, which allows us to search a parse tree. The two most common and used methods are find() and find_all(). Before talking about find() and find_all(), let us see some examples of different filters you can pass into these methods.

How do I find the HTML element in BeautifulSoup?

Approach: Here we first import the regular expressions and BeautifulSoup libraries. Then we open the HTML file using the open function which we want to parse. Then using the find_all function, we find a particular tag that we pass inside that function and also the text we want to have within the tag.

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.


2 Answers

The problem is that your <a> tag with the <i> tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.

NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.

From the docs:

Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”:

soup.find_all("a", string="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>] 

Now let's take a look what Tag's string attribute is (from the docs again):

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

title_tag.string # u'The Dormouse's story' 

(...)

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

print(soup.html.string) # None 

This is exactly your case. Your <a> tag contains a text and <i> tag. Therefore, the find gets None when trying to search for a string and thus it can't match.

How to solve this?

Maybe there is a better solution but I would probably go with something like this:

import re from bs4 import BeautifulSoup as BS  soup = BS(""" <a href="/customer-menu/1/accounts/1/update">     <i class="fa fa-edit"></i> Edit </a> """)  links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")  for link in links:     if link.find(text=re.compile("Edit")):         thelink = link         break  print(thelink) 

I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.

like image 137
geckon Avatar answered Oct 15 '22 22:10

geckon


in one line using lambda

soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text) 
like image 40
Amr Avatar answered Oct 15 '22 22:10

Amr