Observe the following problem: <pre class="prettyprint"><code>import re from bs4 import BeautifulSoup as BS soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> Edit </a> """) # This returns the <a> element soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*") ) soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> Edit </a> """) # This returns None soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*") ) </code></pre> For some reason, BeautifulSoup will not match the text, when the <code></code> tag is there as well. Finding the tag and showing its text produces <pre class="prettyprint"><code>>>> a2 = soup.find( 'a', href="/customer-menu/1/accounts/1/update" ) >>> print(repr(a2.text)) '\n Edit\n' </code></pre> Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag: <pre class="prettyprint"><code>pattern = re.compile('.*Edit.*') pattern.match('\n Edit\n') # Returns None pattern = re.compile('.*Edit.*', flags=re.DOTALL) pattern.match('\n Edit\n') # Returns MatchObject </code></pre> Alright. Looks good. Let's try it with soup <pre class="prettyprint"><code>soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> Edit </a> """) soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*", flags=re.DOTALL) ) # Still return None... Why?! </code></pre> <h3>Edit</h3> My solution based on geckons answer: I implemented these helpers: <pre class="prettyprint"><code>import re MATCH_ALL = r'.*' def like(string): """ Return a compiled regular expression that matches the given string with any prefix and postfix, e.g. if string = "hello", the returned regex matches r".*hello.*" """ string_ = string if not isinstance(string_, str): string_ = str(string_) regex = MATCH_ALL + re.escape(string_) + MATCH_ALL return re.compile(regex, flags=re.DOTALL) def find_by_text(soup, text, tag, **kwargs): """ Find the tag in soup that matches all provided kwargs, and contains the text. If no match is found, return None. If more than one match is found, raise ValueError. """ elements = soup.find_all(tag, **kwargs) matches = [] for element in elements: if element.find(text=like(text)): matches.append(element) if len(matches) > 1: raise ValueError("Too many matches:\n" + "\n".join(matches)) elif len(matches) == 0: return None else: return matches[0] </code></pre> Now, when I want to find the element above, I just run <code>find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')</code>

The problem is that your <code><a></code> tag with the <code></code> tag inside, doesn't have the <code>string</code> attribute you expect it to have. First let's take a look at what <code>text=""</code> argument for <code>find()</code> does. NOTE: The <code>text</code> argument is an old name, since BeautifulSoup 4.4.0 it's called <code>string</code>. From the docs: <blockquote> Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”: <pre class="prettyprint"><code>soup.find_all("a", string="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>] </code></pre> </blockquote> Now let's take a look what <code>Tag</code>'s <code>string</code> attribute is (from the docs again): <blockquote> If a tag has only one child, and that child is a NavigableString, the child is made available as .string: <pre class="prettyprint"><code>title_tag.string # u'The Dormouse's story' </code></pre> </blockquote> (...) <blockquote> If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None: <pre class="prettyprint"><code>print(soup.html.string) # None </code></pre> </blockquote> This is exactly your case. Your <code><a></code> tag contains a text and <code></code> tag. Therefore, the find gets <code>None</code> when trying to search for a string and thus it can't match. How to solve this? Maybe there is a better solution but I would probably go with something like this: <pre class="prettyprint"><code>import re from bs4 import BeautifulSoup as BS soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> Edit </a> """) links = soup.find_all('a', href="/customer-menu/1/accounts/1/update") for link in links: if link.find(text=re.compile("Edit")): thelink = link break print(thelink) </code></pre> I think there are not too many links pointing to <code>/customer-menu/1/accounts/1/update</code> so it should be fast enough.

in one line using lambda <pre class="prettyprint"><code>soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text) </code></pre>

BeautifulSoup - search by text inside a tag

Tags:

python

regex

beautifulsoup

Observe the following problem:

import re from bs4 import BeautifulSoup as BS  soup = BS(""" <a href="/customer-menu/1/accounts/1/update">     Edit </a> """)  # This returns the <a> element soup.find(     'a',     href="/customer-menu/1/accounts/1/update",     text=re.compile(".*Edit.*") )  soup = BS(""" <a href="/customer-menu/1/accounts/1/update">     <i class="fa fa-edit"></i> Edit </a> """)  # This returns None soup.find(     'a',     href="/customer-menu/1/accounts/1/update",     text=re.compile(".*Edit.*") )

For some reason, BeautifulSoup will not match the text, when the  tag is there as well. Finding the tag and showing its text produces

>>> a2 = soup.find(         'a',         href="/customer-menu/1/accounts/1/update"     ) >>> print(repr(a2.text)) '\n Edit\n'

Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:

pattern = re.compile('.*Edit.*') pattern.match('\n Edit\n')  # Returns None  pattern = re.compile('.*Edit.*', flags=re.DOTALL) pattern.match('\n Edit\n')  # Returns MatchObject

Alright. Looks good. Let's try it with soup

soup = BS(""" <a href="/customer-menu/1/accounts/1/update">     <i class="fa fa-edit"></i> Edit </a> """)  soup.find(     'a',     href="/customer-menu/1/accounts/1/update",     text=re.compile(".*Edit.*", flags=re.DOTALL) )  # Still return None... Why?!

Edit

My solution based on geckons answer: I implemented these helpers:

import re  MATCH_ALL = r'.*'   def like(string):     """     Return a compiled regular expression that matches the given     string with any prefix and postfix, e.g. if string = "hello",     the returned regex matches r".*hello.*"     """     string_ = string     if not isinstance(string_, str):         string_ = str(string_)     regex = MATCH_ALL + re.escape(string_) + MATCH_ALL     return re.compile(regex, flags=re.DOTALL)   def find_by_text(soup, text, tag, **kwargs):     """     Find the tag in soup that matches all provided kwargs, and contains the     text.      If no match is found, return None.     If more than one match is found, raise ValueError.     """     elements = soup.find_all(tag, **kwargs)     matches = []     for element in elements:         if element.find(text=like(text)):             matches.append(element)     if len(matches) > 1:         raise ValueError("Too many matches:\n" + "\n".join(matches))     elif len(matches) == 0:         return None     else:         return matches[0]

Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

813

asked Aug 12 '15 07:08

Eldamir

2 Answers

The problem is that your <a> tag with the  tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.

NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.

From the docs:

Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”:
soup.find_all("a", string="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

Now let's take a look what Tag's string attribute is (from the docs again):

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
title_tag.string # u'The Dormouse's story' 

(...)

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
print(soup.html.string) # None 

This is exactly your case. Your <a> tag contains a text and  tag. Therefore, the find gets None when trying to search for a string and thus it can't match.

How to solve this?

Maybe there is a better solution but I would probably go with something like this:

import re from bs4 import BeautifulSoup as BS  soup = BS(""" <a href="/customer-menu/1/accounts/1/update">     <i class="fa fa-edit"></i> Edit </a> """)  links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")  for link in links:     if link.find(text=re.compile("Edit")):         thelink = link         break  print(thelink)

I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.

137

answered Oct 15 '22 22:10

geckon

in one line using lambda

soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)

answered Oct 15 '22 22:10

Amr

Related questions
                            
                                Pylint can't find SQLAlchemy query member
                            
                                How can I modify Procfile to run Gunicorn process in a non-standard folder on Heroku?
                            
                                How to check whether a directory is a sub directory of another directory
                            
                                How to make a python dictionary that returns key for keys missing from the dictionary instead of raising KeyError?
                            
                                Alternative to execfile in Python 3? [duplicate]
                            
                                how to add lines to existing file using python
                            
                                zlib.error: Error -3 while decompressing: incorrect header check
                            
                                Django Many-to-Many (m2m) Relation to same model
                            
                                Get file size using python-requests, while only getting the header
                            
                                Replacing a character from a certain index [duplicate]
                            
                                How to load a list of numpy arrays to pytorch dataset loader?
                            
                                Detect face then autocrop pictures
                            
                                Flask SQLAlchemy querying a column with "not equals"
                            
                                Flask logging - Cannot get it to write to a file
                            
                                Populating django field with pre_save()?
                            
                                A Python script that activates the virtualenv and then runs another Python script?
                            
                                Why do std::string operations perform poorly?
                            
                                Argparse: Check if any arguments have been passed
                            
                                Editing the date formatting of x-axis tick labels in matplotlib
                            
                                Detect if a variable is a datetime object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With