I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}
<h2> this is cool #12345678901 </h2>
So, the previous would match by using:
soup('h2',text=re.compile(r' #\S{11}'))
And the results would be something like:
[u'blahblah #223409823523', u'thisisinteresting #293845023984']
I'm able to get all the text that matches (see line above). But I want the parent element of the text to match, so I can use that as a starting point for traversing the document tree. In this case, I'd want all the h2 elements to return, not the text matches.
Ideas?
Approach: Here we first import the regular expressions and BeautifulSoup libraries. Then we open the HTML file using the open function which we want to parse. Then using the find_all function, we find a particular tag that we pass inside that function and also the text we want to have within the tag.
from BeautifulSoup import BeautifulSoup
import re
html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""
soup = BeautifulSoup(html_text)
for elem in soup(text=re.compile(r' #\S{11}')):
print elem.parent
Prints:
<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
BeautifulSoup search operations deliver [a list of] BeautifulSoup.NavigableString
objects when text=
is used as a criteria as opposed to BeautifulSoup.Tag
in other cases. Check the object's __dict__
to see the attributes made available to you. Of these attributes, parent
is favored over previous
because of changes in BS4.
from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re
html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""
soup = BeautifulSoup(html_text)
# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')
pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>> 'nextSibling': None,
#>> 'parent': <h2>this is cool #12345678901</h2>,
#>> 'previous': <h2>this is cool #12345678901</h2>,
#>> 'previousSibling': None}
print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True
With bs4 (Beautiful Soup 4), the OP's attempt works exactly like expected:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))
returns [<h2> this is cool #12345678901 </h2>]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With