Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find all comments with Beautiful Soup

This question was asked four years ago, but the answer is now out of date for BS4.

I want to delete all comments in my html file using beautiful soup. Since BS4 makes each comment as a special type of navigable string, I thought this code would work:

for comments in soup.find_all('comment'):
     comments.decompose()

So that didn't work.... How do I find all comments using BS4?

like image 664
Joseph Avatar asked Oct 15 '15 02:10

Joseph


People also ask

Which BeautifulSoup method will find all table tags on the page?

In order to print all the heading tags using BeautifulSoup, we use the find_all() method. The find_all method is one of the most common methods in BeautifulSoup. It looks through a tag and retrieves all the occurrences of that tag.

Which BeautifulSoup is not editable?

BeautifulSoupD. ParserCorrect Option : BEXPLANATION : You cannot edit the Navigable String object but can convert it into a Unicode stringusing the function Unicode.


2 Answers

You can pass a function to find_all() to help it check whether the string is a Comment.

For example I have below html:

<body>
   <!-- Branding and main navigation -->
   <div class="Branding">The Science &amp; Safety Behind Your Favorite Products</div>
   <div class="l-branding">
      <p>Just a brand</p>
   </div>
   <!-- test comment here -->
   <div class="block_content">
      <a href="https://www.google.com">Google</a>
   </div>
</body>

Code:

from bs4 import BeautifulSoup as BS
from bs4 import Comment
....
soup = BS(html, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
    print(c)
    print("===========")
    c.extract()

the output would be:

Branding and main navigation 
============
test comment here
============

BTW, I think the reason why find_all('Comment') doesn't work is (from BeautifulSoup document):

Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.

like image 108
Flickerlight Avatar answered Oct 13 '22 01:10

Flickerlight


Two things I needed to do:

First, when importing Beautiful Soup

from bs4 import BeautifulSoup, Comment

Second, here's the code to extract comments

for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
    comments.extract()
like image 36
Joseph Avatar answered Oct 13 '22 02:10

Joseph