<p>I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}</p> <pre class="prettyprint"><code><h2> this is cool #12345678901 </h2> </code></pre> <p>So, the previous would match by using:</p> <pre class="prettyprint"><code>soup('h2',text=re.compile(r' #\S{11}')) </code></pre> <p>And the results would be something like:</p> <pre class="prettyprint"><code>[u'blahblah #223409823523', u'thisisinteresting #293845023984'] </code></pre> <p>I'm able to get all the text that matches (see line above). But I want the parent element of the text to match, so I can use that as a starting point for traversing the document tree. In this case, I'd want all the h2 elements to return, not the text matches.</p> <p>Ideas?</p>

<pre class="prettyprint"><code>from BeautifulSoup import BeautifulSoup import re html_text = """ <h2>this is cool #12345678901</h2> <h2>this is nothing</h2> <h1>foo #126666678901</h1> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2> """ soup = BeautifulSoup(html_text) for elem in soup(text=re.compile(r' #\S{11}')): print elem.parent </code></pre> <p>Prints:</p> <pre class="prettyprint"><code><h2>this is cool #12345678901</h2> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2> </code></pre>

<p>With bs4 (Beautiful Soup 4), the OP's attempt works exactly like expected:</p> <pre class="prettyprint"><code>from bs4 import BeautifulSoup soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>") soup('h2',text=re.compile(r' #\S{11}')) </code></pre> <p>returns <code>[<h2> this is cool #12345678901 </h2>]</code>.</p>

Using BeautifulSoup to find a HTML tag that contains certain text

Tags:

python

regex

beautifulsoup

html-content-extraction

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}

<h2> this is cool #12345678901 </h2>

So, the previous would match by using:

soup('h2',text=re.compile(r' #\S{11}'))

And the results would be something like:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

I'm able to get all the text that matches (see line above). But I want the parent element of the text to match, so I can use that as a starting point for traversing the document tree. In this case, I'd want all the h2 elements to return, not the text matches.

Ideas?

457

asked May 14 '09 21:05

sotangochips

3 Answers

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

Prints:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

174

answered Oct 17 '22 15:10

nosklo

BeautifulSoup search operations deliver [a list of] BeautifulSoup.NavigableString objects when text= is used as a criteria as opposed to BeautifulSoup.Tag in other cases. Check the object's __dict__ to see the attributes made available to you. Of these attributes, parent is favored over previous because of changes in BS4.

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

answered Oct 17 '22 13:10

Bruno Bronosky

With bs4 (Beautiful Soup 4), the OP's attempt works exactly like expected:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

returns [<h2> this is cool #12345678901 </h2>].

answered Oct 17 '22 13:10

T.C. Proctor

Related questions
                            
                                TypeError: super() takes at least 1 argument (0 given) error is specific to any python version?
                            
                                What is __path__ useful for?
                            
                                Close pre-existing figures in matplotlib when running from eclipse
                            
                                How can I dynamically create class methods for a class in python [duplicate]
                            
                                Django Admin: OneToOne Relation as an Inline?
                            
                                PyQt or PySide - which one to use [closed]
                            
                                How to add trendline in python matplotlib dot (scatter) graphs?
                            
                                What refactoring tools do you use for Python? [closed]
                            
                                What is the point of indexing in pandas?
                            
                                Most efficient property to hash for numpy array
                            
                                A logarithmic colorbar in matplotlib scatter plot
                            
                                pip install . creates only the dist-info not the package
                            
                                Create random list of integers in Python
                            
                                Is there a HAML implementation for use with Python and Django
                            
                                Use a library locally instead of installing it
                            
                                Django Celery Logging Best Practice
                            
                                How to make a variable inside a try/except block public?
                            
                                How to prevent airflow from backfilling dag runs?
                            
                                Emulating pass-by-value behaviour in python
                            
                                Create column of value_counts in Pandas dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With