Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook: <pre class="prettyprint lang-html prettyprint-override"><code> </code></pre> So, using the simple code as follows, <pre class="prettyprint"><code>content = page.read() soup = BeautifulSoup(content) results = soup.find_all("p", "review_comment") </code></pre> I am happily parsing the text that is living here: <pre class="prettyprint lang-html prettyprint-override"><code> This place is terrible! </code></pre> The bad news is that every 30 or so times the <code>soup.find_all</code> gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated: <pre class="prettyprint lang-html prettyprint-override"><code> It's 1999, and I will always love this place… <a href="#" class="show-archived">Read more &raquo;</a> </code></pre> In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas. <ul> <li>I've been trying to alter the arguments in my <code>soup.find_all()</code> call to specifically exclude any text that comes before the <code><a href="#" class="show-archived">Read more &raquo;</a></code> </li> <li>I've drowned in Regular Expressions-type matching limbo with no success. </li> <li>I can't seem to take advantage of the <code>class="show-archived"</code> attribute.</li> </ul> Any ideas would be gratefully appreciated. Thanks in advance.

Is this what you are seeking? <pre class="prettyprint"><code>for p in soup.find_all("p", "review_comment"): if p.find(class_='show-archived'): continue # p is now a wanted p </code></pre>

Excluding unwanted results of findAll using BeautifulSoup

Tags:

python

beautifulsoup

screen-scraping

Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:

<p class="review_comment">

So, using the simple code as follows,

content = page.read()  
soup = BeautifulSoup(content)  
results = soup.find_all("p", "review_comment")

I am happily parsing the text that is living here:

<p class="review_comment">
    This place is terrible!</p>

The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:

<p class="review_comment">
    It's 1999, and I will always love this place…  
<a href="#" class="show-archived">Read more &raquo;</a></p>

In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.

I've been trying to alter the arguments in my soup.find_all() call to specifically exclude any text that comes before the <a href="#" class="show-archived">Read more »</a>
I've drowned in Regular Expressions-type matching limbo with no success.
I can't seem to take advantage of the class="show-archived" attribute.

Any ideas would be gratefully appreciated. Thanks in advance.

912

asked Oct 13 '13 23:10

tumultous_rooster

1 Answers

Is this what you are seeking?

for p in soup.find_all("p", "review_comment"):
    if p.find(class_='show-archived'):
        continue
    # p is now a wanted p

172

answered Oct 13 '22 18:10

msw

Related questions
                            
                                How to join mixed list (array) (with integers in it) in Python?
                            
                                Why does matplotlib fill_between draw edgelines only on a PDF?
                            
                                Why required and default are mutally exclusive in ndb?
                            
                                Prevent Sympy from rearranging the equation
                            
                                In sqlalchemy, how can I use polymorphic joined table inheritance when the child table has multiple foreign keys to the parent table?
                            
                                Python argparse: How to insert newline the help text in subparser?
                            
                                How can I retrieve all Tweets and attributes for a given user using Python?
                            
                                webapp2 with python3
                            
                                How to make a timer program in Python
                            
                                How to monkeypatch builtin function datetime.datetime.now?
                            
                                Classifiying a set of Images into Classes
                            
                                Simultaneous assignment semantics in Python
                            
                                Pandas: run length of NaN holes
                            
                                ftp sending python bytesio stream
                            
                                Latex Citation in matplotlib Legend
                            
                                Pip doesn't install packages to activated virtualenv, ignores requirements.txt
                            
                                Can you plot live data in matplotlib?
                            
                                Python writable buffer/memoryview to array/bytearray/ctypes string buffer
                            
                                Flask-WTF FileField does not set data attribute to an instance of Werkzeug FileStorage
                            
                                What happens to other threads when main thread calls sys.exit()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With