How to find tag with particular text with Beautiful Soup?

People also ask

How do I find my parents tag on BeautifulSoup?

Pass the HTML document into the Beautifulsoup() function. Get the tag or element in the document or HTML. Use the ". parent" function to find out the parent of any tag.

You can pass a regular expression to the text parameter of findAll, like so:

import BeautifulSoup
import re

columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})

This post got me to my answer even though the answer is missing from this post. I felt I should give back.

The challenge here is in the inconsistent behavior of BeautifulSoup.find when searching with and without text.

Note: If you have BeautifulSoup, you can test this locally via:

curl https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python

Code: https://gist.github.com/4060082

# Taken from https://gist.github.com/4060082
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint
import re

soup = BeautifulSoup(urlopen('https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read())
# I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear.
pattern = re.compile('Fixed text')

# Peter's suggestion here returns a list of what appear to be strings
columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'})
# ...but it is actually a BeautifulSoup.NavigableString
print type(columns[0])
#>> <class 'BeautifulSoup.NavigableString'>

# you can reach the tag using one of the convenience attributes seen here
pprint(columns[0].__dict__)
#>> {'next': <br />,
#>>  'nextSibling': <br />,
#>>  'parent': <td class="pos">\n
#>>       "Fixed text:"\n
#>>       <br />\n
#>>       <strong>text I am looking for</strong>\n
#>>   </td>,
#>>  'previous': <td class="pos">\n
#>>       "Fixed text:"\n
#>>       <br />\n
#>>       <strong>text I am looking for</strong>\n
#>>   </td>,
#>>  'previousSibling': None}

# I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
# So, if you want to find the 'text' in the 'strong' element...
pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})])
#>> [u'text I am looking for']

# Here is what we have learned:
print soup.find('strong')
#>> <strong>some value</strong>
print soup.find('strong', text='some value')
#>> u'some value'
print soup.find('strong', text='some value').parent
#>> <strong>some value</strong>
print soup.find('strong', text='some value') == soup.find('strong')
#>> False
print soup.find('strong', text='some value') == soup.find('strong').text
#>> True
print soup.find('strong', text='some value').parent == soup.find('strong')
#>> True

Though it is most certainly too late to help the OP, I hope they will make this as the answer since it does satisfy all quandaries around finding by text.

With bs4 4.7.1+ you can use :contains pseudo class to specify the td containing your (filter) search string. You can then use a descendant child combinator, in this case, to move to the strong containing target text:

from bs4 import BeautifulSoup as bs

html = '''
<tr>
  <td class="pos">\n
      "Some text:"\n
      <br>\n
      <strong>some value</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Fixed text:"\n
      <br>\n
      <strong>text I am looking for</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Some other text:"\n
      <br>\n
      <strong>some other value</strong>\n
  </td>
</tr>'''
soup = bs(html, 'lxml')
print(soup.select_one('td:contains("Fixed text:") strong').text)

soupsieve 2.1.0 onwards:

NEW: In order to avoid conflicts with future CSS specification changes, non-standard pseudo classes will now start with the :-soup- prefix. As a consequence, :contains() will now be known as :-soup-contains(), though for a time the deprecated form of :contains() will still be allowed with a warning that users should migrate over to :-soup-contains().

NEW: Added new non-standard pseudo class :-soup-contains-own() which operates similar to :-soup-contains() except that it only looks at text nodes directly associated with the currently scoped element and not its descendants.

Quote from @facelessuser github page.

Since Beautiful Soup 4.4.0. a parameter called string does the work that text used to do in the previous versions.

string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for the string. This code finds the tags whose .string is “Elsie”:

soup.find_all("td", string="Elsie")

For more information about string have a look this section https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument

A solution for finding a anchor tag if having a particular keyword would be the following:

from bs4 import BeautifulSoup
from urllib.request import urlopen,Request
from urllib.parse import urljoin,urlparse

rawLinks=soup.findAll('a',href=True)
for link in rawLinks:
    innercontent=link.text
    if keyword.lower() in innercontent.lower():
        print(link)

result = soup.find('strong', text='text I am looking for').text

Related questions
                            
                                PIL: DLL load failed: specified procedure could not be found
                            
                                How to keep Unit tests and Integrations tests separate in pytest
                            
                                Increase resolution with word-cloud and remove empty border
                            
                                Python sqlite3 version
                            
                                get path of python binary that's executing the script [duplicate]
                            
                                List of pylint human readable message ids?
                            
                                Add zeros to a float after the decimal point in Python
                            
                                Pandas Dataframe display on a webpage
                            
                                How to install CUDA in Google Colab GPU's
                            
                                How can I check all the installed Python versions on Windows?
                            
                                How to set any font in reportlab Canvas in python?
                            
                                Interactive Python: cannot get `%lprun` to work, although line_profiler is imported properly
                            
                                python pandas loc - filter for list of values [duplicate]
                            
                                How to use virtualenv with python3.6 on ubuntu 16.04?
                            
                                "except Foo as bar" causes "bar" to be removed from scope [duplicate]
                            
                                Multithreading for Python Django
                            
                                How to make a simple multithreaded socket server in Python that remembers clients
                            
                                How to tune parameters in Random Forest, using Scikit Learn?
                            
                                What is the meaning of bind = True keyword in celery?
                            
                                Join float list into space-separated string in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find tag with particular text with Beautiful Soup?

Tags:

python

html

beautifulsoup

web-scraping

People also ask

Recent Activity

Donate For Us