I want to grab the text that comes after Description and before the Next Header.
I know that:
In [8]: soup.findAll('h2')[6]
Out[8]: <h2>Description</h2>
However, I don’t know how to grab the actual text. The problem is I have multiple links to do this on. Some have the p:
<h2>Description</h2>
<p>This is the text I want </p>
<p>This is the text I want</p>
<h2>Next header</h2>
But, some don’t:
> <h2>Description</h2>
> This is the text I want
>
> <h2>Next header</h2>
Also on each one with the p, I can’t just do soup.findAll(‘p’)[22] because on some the ‘p’ is at 21 or 20.
Check for NavigableString to check if the next sibling is a text node or Tag to check if it is an element.
Break the loop if your next sibling is an header.
from bs4 import BeautifulSoup, NavigableString, Tag
import requests
example = """<h2>Description</h2><p>This is the text I want </p><p>This is the text I want</p><h2>Next header</h2>"""
soup = BeautifulSoup(example, 'html.parser')
for header in soup.find_all('h2'):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, NavigableString):
print (nextNode.strip())
if isinstance(nextNode, Tag):
if nextNode.name == "h2":
break
print (nextNode.get_text(strip=True).strip())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With