Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excluding unwanted results of findAll using BeautifulSoup

Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:

<p class="review_comment">

So, using the simple code as follows,

content = page.read()  
soup = BeautifulSoup(content)  
results = soup.find_all("p", "review_comment")

I am happily parsing the text that is living here:

<p class="review_comment">
    This place is terrible!</p>

The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:

<p class="review_comment">
    It's 1999, and I will always love this place…  
<a href="#" class="show-archived">Read more &raquo;</a></p>

In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.

  • I've been trying to alter the arguments in my soup.find_all() call to specifically exclude any text that comes before the <a href="#" class="show-archived">Read more &raquo;</a>
  • I've drowned in Regular Expressions-type matching limbo with no success.
  • I can't seem to take advantage of the class="show-archived" attribute.

Any ideas would be gratefully appreciated. Thanks in advance.

like image 912
tumultous_rooster Avatar asked Oct 13 '13 23:10

tumultous_rooster


People also ask

How do you exclude a tag in beautiful soup?

You can use extract() to remove unwanted tag before you get text. But it keeps all '\n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML). extract() works but only if u have only one unwanted.

What does Findall return Beautifulsoup?

Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.

What is the difference between find and Findall in Beautifulsoup?

find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.

How do I search for two classes in beautiful soup?

To find elements by class in Beautiful Soup use the find_all(~) or select(~) method.


1 Answers

Is this what you are seeking?

for p in soup.find_all("p", "review_comment"):
    if p.find(class_='show-archived'):
        continue
    # p is now a wanted p
like image 172
msw Avatar answered Oct 13 '22 18:10

msw