Using Python 3 and BeautifulSoup 4, I would like to be able to extract text from an HTML page that only delineated by a comment above it. An example: <pre class="prettyprint"><code><\!--UNIQUE COMMENT--> I would like to get this text <\!--SECOND UNIQUE COMMENT--> I would also like to find this text </code></pre> I have found various ways to extract a page's text or comments, but no way to do what I'm looking for. Any help would be greatly appreciated.

You just need to iterate through all of the available comments to see if it is one of your required entries, and then display the text for the following element as follows: <pre class="prettyprint"><code>from bs4 import BeautifulSoup, Comment html = """ <html> <body> p tag text  I would like to get this text  I would also like to find this text </body> </html> """ soup = BeautifulSoup(html, 'lxml') for comment in soup.findAll(text=lambda text:isinstance(text, Comment)): if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']: print comment.next_element.strip() </code></pre> This would display the following: <pre class="prettyprint lang-none prettyprint-override"><code>I would like to get this text I would also like to find this text </code></pre>

Python's<code>bs4</code> module has a Comment class. You can use that extract the comments. <pre class="prettyprint"><code>from bs4 import BeautifulSoup, Comment html = """ <html> <body> p tag text  I would like to get this text  I would also like to find this text </body> </html> """ soup = BeautifulSoup(html, 'lxml') comments = soup.findAll(text=lambda text:isinstance(text, Comment)) </code></pre> This will give you the Comment elements. <pre class="prettyprint"><code>[u'UNIQUE COMMENT', u'SECOND UNIQUE COMMENT'] </code></pre>

Extracting Text Between HTML Comments with BeautifulSoup

Tags:

python

python-3.x

beautifulsoup

web-scraping

Using Python 3 and BeautifulSoup 4, I would like to be able to extract text from an HTML page that only delineated by a comment above it. An example:

<\!--UNIQUE COMMENT-->
I would like to get this text
<\!--SECOND UNIQUE COMMENT-->
I would also like to find this text

I have found various ways to extract a page's text or comments, but no way to do what I'm looking for. Any help would be greatly appreciated.

970

asked Jan 08 '16 09:01

LANshark

3 Answers

You just need to iterate through all of the available comments to see if it is one of your required entries, and then display the text for the following element as follows:

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')

for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
    if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']:
        print comment.next_element.strip()

This would display the following:

I would like to get this text
I would also like to find this text

answered Oct 14 '22 13:10

Martin Evans

An improvement to the Martin's answer - you can search for specific comments directly - no need to iterate over all the comment and then check the values - do it in one go:

comments_to_search_for = {'UNIQUE COMMENT', 'SECOND UNIQUE COMMENT'}
for comment in soup.find_all(text=lambda text: isinstance(text, Comment) and text in comments_to_search_for):
    print(comment.next_element.strip())

Prints:

I would like to get this text
I would also like to find this text

answered Oct 14 '22 14:10

alecxe

Python'sbs4 module has a Comment class. You can use that extract the comments.

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
comments = soup.findAll(text=lambda text:isinstance(text, Comment))

This will give you the Comment elements.

[u'UNIQUE COMMENT', u'SECOND UNIQUE COMMENT']

answered Oct 14 '22 13:10

salmanwahed

Related questions
                            
                                python list comprehension by step of 2
                            
                                Python Shutil.copy if I have a duplicate file will it copy to new location
                            
                                Python Threads object append to list
                            
                                OpenCV - Adjusting photo with skew angle (tilt)
                            
                                How do you delete an argument from a namespace
                            
                                'dict' object has no attribute 'append' Json
                            
                                How to save the file with different name and not overwriting existing one
                            
                                reading scientific notation csv file with numpy
                            
                                Flask: redirecting nonexistent URL's
                            
                                Can't write video by opencv in Python
                            
                                Write list of dictionary into CSV Python
                            
                                Jupyter Notebook Broken by Python 3.5
                            
                                How can I prevent lxml from auto-closing empty elements when serializing to string?
                            
                                Why can't I build wheel for libsass even though it will install?
                            
                                How to print an entire list while not starting by the first item
                            
                                python .get() and None
                            
                                How to iterate over a list in django templates? [duplicate]
                            
                                Nested For Loop in Jinja2
                            
                                Difference between Python's collections.Counter and nltk.probability.FreqDist
                            
                                LabelEncoder: How to keep a dictionary that shows original and converted variable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With