Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup 4: Remove comment tag and its content

The page that I'm scraping contains these HTML codes. How do I remove the comment tag <!-- --> along with its content with bs4?

<div class="foo">
cat dog sheep goat
<!-- 
<p>NewPP limit report
Preprocessor node count: 478/300000
Post‐expand include size: 4852/2097152 bytes
Template argument size: 870/2097152 bytes
Expensive parser function count: 2/100
ExtLoops count: 6/100
</p>
-->
</div>
like image 939
Flint Avatar asked Apr 25 '14 17:04

Flint


People also ask

How do you get text from soup?

Pass the HTML document into the Beautifulsoup() function. Use the 'P' tag to extract paragraphs from the Beautifulsoup object. Get text from the HTML document with get_text().


1 Answers

You can use extract() (solution is based on this answer):

PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.

from bs4 import BeautifulSoup, Comment

data = """<div class="foo">
cat dog sheep goat
<!--
<p>test</p>
-->
</div>"""

soup = BeautifulSoup(data)

div = soup.find('div', class_='foo')
for element in div(text=lambda text: isinstance(text, Comment)):
    element.extract()

print soup.prettify()

As a result you get your div without comments:

<div class="foo">
    cat dog sheep goat
</div>
like image 180
alecxe Avatar answered Sep 23 '22 04:09

alecxe