Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get text in between two h2 headers using BeautifulSoup

I want to grab the text that comes after Description and before the Next Header.

I know that:

In [8]: soup.findAll('h2')[6]
Out[8]: <h2>Description</h2>

However, I don’t know how to grab the actual text. The problem is I have multiple links to do this on. Some have the p:

                                         <h2>Description</h2>

  <p>This is the text I want </p>
<p>This is the text I want</p>   
                                        <h2>Next header</h2>

But, some don’t:

>                                       <h2>Description</h2>
>                        This is the text I want                 
> 
>                                       <h2>Next header</h2>

Also on each one with the p, I can’t just do soup.findAll(‘p’)[22] because on some the ‘p’ is at 21 or 20.

like image 608
user6754289 Avatar asked Mar 15 '26 21:03

user6754289


1 Answers

Check for NavigableString to check if the next sibling is a text node or Tag to check if it is an element.

Break the loop if your next sibling is an header.

from bs4 import BeautifulSoup, NavigableString, Tag
import requests

example = """<h2>Description</h2><p>This is the text I want </p><p>This is the text I want</p><h2>Next header</h2>"""

soup = BeautifulSoup(example, 'html.parser')
for header in soup.find_all('h2'):
    nextNode = header
    while True:
        nextNode = nextNode.nextSibling
        if nextNode is None:
            break
        if isinstance(nextNode, NavigableString):
            print (nextNode.strip())
        if isinstance(nextNode, Tag):
            if nextNode.name == "h2":
                break
            print (nextNode.get_text(strip=True).strip())
like image 65
Zroq Avatar answered Mar 17 '26 11:03

Zroq



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!