Pass the HTML document into the Beautifulsoup() function. Get the tag or element in the document or HTML. Use the ". parent" function to find out the parent of any tag.
You can pass a regular expression to the text parameter of findAll
, like so:
import BeautifulSoup
import re
columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})
This post got me to my answer even though the answer is missing from this post. I felt I should give back.
The challenge here is in the inconsistent behavior of BeautifulSoup.find
when searching with and without text.
Note: If you have BeautifulSoup, you can test this locally via:
curl https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python
Code: https://gist.github.com/4060082
# Taken from https://gist.github.com/4060082
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint
import re
soup = BeautifulSoup(urlopen('https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read())
# I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear.
pattern = re.compile('Fixed text')
# Peter's suggestion here returns a list of what appear to be strings
columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'})
# ...but it is actually a BeautifulSoup.NavigableString
print type(columns[0])
#>> <class 'BeautifulSoup.NavigableString'>
# you can reach the tag using one of the convenience attributes seen here
pprint(columns[0].__dict__)
#>> {'next': <br />,
#>> 'nextSibling': <br />,
#>> 'parent': <td class="pos">\n
#>> "Fixed text:"\n
#>> <br />\n
#>> <strong>text I am looking for</strong>\n
#>> </td>,
#>> 'previous': <td class="pos">\n
#>> "Fixed text:"\n
#>> <br />\n
#>> <strong>text I am looking for</strong>\n
#>> </td>,
#>> 'previousSibling': None}
# I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
# So, if you want to find the 'text' in the 'strong' element...
pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})])
#>> [u'text I am looking for']
# Here is what we have learned:
print soup.find('strong')
#>> <strong>some value</strong>
print soup.find('strong', text='some value')
#>> u'some value'
print soup.find('strong', text='some value').parent
#>> <strong>some value</strong>
print soup.find('strong', text='some value') == soup.find('strong')
#>> False
print soup.find('strong', text='some value') == soup.find('strong').text
#>> True
print soup.find('strong', text='some value').parent == soup.find('strong')
#>> True
Though it is most certainly too late to help the OP, I hope they will make this as the answer since it does satisfy all quandaries around finding by text.
With bs4 4.7.1+ you can use :contains pseudo class to specify the td
containing your (filter) search string. You can then use a descendant child combinator, in this case, to move to the strong
containing target text:
from bs4 import BeautifulSoup as bs
html = '''
<tr>
<td class="pos">\n
"Some text:"\n
<br>\n
<strong>some value</strong>\n
</td>
</tr>
<tr>
<td class="pos">\n
"Fixed text:"\n
<br>\n
<strong>text I am looking for</strong>\n
</td>
</tr>
<tr>
<td class="pos">\n
"Some other text:"\n
<br>\n
<strong>some other value</strong>\n
</td>
</tr>'''
soup = bs(html, 'lxml')
print(soup.select_one('td:contains("Fixed text:") strong').text)
soupsieve 2.1.0 onwards:
NEW: In order to avoid conflicts with future CSS specification changes, non-standard pseudo classes will now start with the :-soup- prefix. As a consequence, :contains() will now be known as :-soup-contains(), though for a time the deprecated form of :contains() will still be allowed with a warning that users should migrate over to :-soup-contains().
NEW: Added new non-standard pseudo class :-soup-contains-own() which operates similar to :-soup-contains() except that it only looks at text nodes directly associated with the currently scoped element and not its descendants.
Quote from @facelessuser github page.
Since Beautiful Soup 4.4.0.
a parameter called string
does the work that text
used to do in the previous versions.
string
is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for the string. This code finds the tags whose .string is “Elsie”:
soup.find_all("td", string="Elsie")
For more information about string have a look this section https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument
A solution for finding a anchor tag if having a particular keyword would be the following:
from bs4 import BeautifulSoup
from urllib.request import urlopen,Request
from urllib.parse import urljoin,urlparse
rawLinks=soup.findAll('a',href=True)
for link in rawLinks:
innercontent=link.text
if keyword.lower() in innercontent.lower():
print(link)
result = soup.find('strong', text='text I am looking for').text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With