Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:
<p class="review_comment">
So, using the simple code as follows,
content = page.read()
soup = BeautifulSoup(content)
results = soup.find_all("p", "review_comment")
I am happily parsing the text that is living here:
<p class="review_comment">
This place is terrible!</p>
The bad news is that every 30 or so times the soup.find_all
gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:
<p class="review_comment">
It's 1999, and I will always love this place…
<a href="#" class="show-archived">Read more »</a></p>
In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.
soup.find_all()
call
to specifically exclude any text that comes before the <a href="#"
class="show-archived">Read more »</a>
class="show-archived"
attribute.Any ideas would be gratefully appreciated. Thanks in advance.
You can use extract() to remove unwanted tag before you get text. But it keeps all '\n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML). extract() works but only if u have only one unwanted.
Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.
find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.
To find elements by class in Beautiful Soup use the find_all(~) or select(~) method.
Is this what you are seeking?
for p in soup.find_all("p", "review_comment"):
if p.find(class_='show-archived'):
continue
# p is now a wanted p
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With