Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting Text Between HTML Comments with BeautifulSoup

Using Python 3 and BeautifulSoup 4, I would like to be able to extract text from an HTML page that only delineated by a comment above it. An example:

<\!--UNIQUE COMMENT-->
I would like to get this text
<\!--SECOND UNIQUE COMMENT-->
I would also like to find this text

I have found various ways to extract a page's text or comments, but no way to do what I'm looking for. Any help would be greatly appreciated.

like image 970
LANshark Avatar asked Jan 08 '16 09:01

LANshark


People also ask

Can BeautifulSoup parse HTML?

The HTML content of the webpages can be parsed and scraped with Beautiful Soup.

How do I get text from Div BeautifulSoup?

BeautifulSoup get text with <br> tags You can use get_text() with an undocumented separator parameter to get the text inside the div like so. Alternatively, you can replace every single <br> tag with an unique string of your choice, then once you get the output, replace that string back to newlines.


3 Answers

You just need to iterate through all of the available comments to see if it is one of your required entries, and then display the text for the following element as follows:

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')

for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
    if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']:
        print comment.next_element.strip()

This would display the following:

I would like to get this text
I would also like to find this text
like image 93
Martin Evans Avatar answered Oct 14 '22 13:10

Martin Evans


An improvement to the Martin's answer - you can search for specific comments directly - no need to iterate over all the comment and then check the values - do it in one go:

comments_to_search_for = {'UNIQUE COMMENT', 'SECOND UNIQUE COMMENT'}
for comment in soup.find_all(text=lambda text: isinstance(text, Comment) and text in comments_to_search_for):
    print(comment.next_element.strip())

Prints:

I would like to get this text
I would also like to find this text
like image 45
alecxe Avatar answered Oct 14 '22 14:10

alecxe


Python'sbs4 module has a Comment class. You can use that extract the comments.

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
comments = soup.findAll(text=lambda text:isinstance(text, Comment))

This will give you the Comment elements.

[u'UNIQUE COMMENT', u'SECOND UNIQUE COMMENT']
like image 27
salmanwahed Avatar answered Oct 14 '22 13:10

salmanwahed