The page that I'm scraping contains these HTML codes. How do I remove the comment tag <!-- -->
along with its content with bs4?
<div class="foo">
cat dog sheep goat
<!--
<p>NewPP limit report
Preprocessor node count: 478/300000
Post‐expand include size: 4852/2097152 bytes
Template argument size: 870/2097152 bytes
Expensive parser function count: 2/100
ExtLoops count: 6/100
</p>
-->
</div>
Pass the HTML document into the Beautifulsoup() function. Use the 'P' tag to extract paragraphs from the Beautifulsoup object. Get text from the HTML document with get_text().
You can use extract()
(solution is based on this answer):
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
from bs4 import BeautifulSoup, Comment
data = """<div class="foo">
cat dog sheep goat
<!--
<p>test</p>
-->
</div>"""
soup = BeautifulSoup(data)
div = soup.find('div', class_='foo')
for element in div(text=lambda text: isinstance(text, Comment)):
element.extract()
print soup.prettify()
As a result you get your div
without comments:
<div class="foo">
cat dog sheep goat
</div>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With