Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

beautifulsoup: find elements after certain element, not necessarily siblings or children

Example html:

<div>
    <p>p1</p>
    <p>p2</p>
    <p>p3<span id="target">starting from here</span></p>
    <p>p4</p>
</div>
<div>
    <p>p5</p>
    <p>p6</p>
</div>
<p>p7</p>

I want to search for <p>s but only if its position is after span#target.

It should return p4, p5, p6 and p7 in the above example.

I tried to get all <p>s first then filter, but then I don't know how do I judge if an element is after span#target or not, either.

like image 656
fireattack Avatar asked Oct 27 '25 07:10

fireattack


2 Answers

You can do this by using the find_all_next function in beautifulsoup.

from bs4 import BeautifulSoup

doc = # Read the HTML here

# Parse the HTML
soup = BeautifulSoup(doc, 'html.parser')

# Select the first element you want to use as the reference
span = soup.select("span#target")[0]

# Find all elements after the `span` element that have the tag - p
print(span.find_all_next("p"))

The above snippet will result in

[<p>p4</p>, <p>p5</p>, <p>p6</p>, <p>p7</p>]

Edit: As per the request to compare position below by OP-

If you want to compare position of 2 elements, you'll have to rely on sourceline and sourcepos provided by the html.parser and html5lib parsing options.

First off, store the sourceline and/or sourcepos of your reference element in a variable.

span_srcline = span.sourceline
span_srcpos = span.sourcepos

(you don't actually have to store them though, you can just do span.sourcepos directly as long as you have the span stored)

Now iterate through the result of find_all_next and compare the values-

for tag in span.find_all_next("p"):
    print(f'line diff: {tag.sourceline - span_srcline}, pos diff: {tag.sourcepos - span_srcpos}, tag: {tag}')

You're most likely interested in line numbers though, as the sourcepos denotes the position on a line.

However, sourceline and sourcepos mean slightly different things for each parser. Check the docs for that info

like image 114
Chase Avatar answered Oct 29 '25 06:10

Chase


Try this

html_doc = """

<div>
    <p>p1</p>
    <p>p2</p>
    <p>p3<span id="target">starting from here</span></p>
    <p>p4</p>
</div>
<div>
    <p>p5</p>
    <p>p6</p>
</div>
<p>p7</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find(id="target").findNext('p').contents[0])

Result

p4
like image 27
mnm Avatar answered Oct 29 '25 07:10

mnm