When parsing long complicated html documents with beautifulsoup, it's sometimes useful to get the exact position in the original string where I've matched an element. I can't simply search for the string, as there may be multiple matching elements and I would lose bs4's ability to parse the DOM. Given this minimal working example:
import bs4
html = "<div><b>Hello</b> <i>World</i></div>"
soup = bs4.BeautifulSoup(html,'lxml')
# Returns 22
print html.find("World")
# How to get this to return 22?
print soup.find("i", text="World")
How can I get the element extracted by bs4
to return 22?
I understand your problem is "World" might be written many times, but you want to obtain the position of an specific occurrence (that you, somehow, know how to identify).
You can use this workaround. I bet there are more elegant solutions, but this should make it:
Given this html:
import bs4
html = """<div><b>Hello</b> <i>World</i></div>
<div><b>Hello</b> <i>Foo World</i></div>
<div><b>Hello</b> <i>Bar World</i></div>"""
soup = bs4.BeautifulSoup(html,'lxml')
If we want to obtain the position of the Foo World occurence we can:
Get the position of the string we added
import bs4
html = """<div><b>Hello</b> <i>World</i></div>
<div><b>Hello</b> <i>Foo World</i></div>
<div><b>Hello</b> <i>Bar World</i></div>"""
soup = bs4.BeautifulSoup(html,'html.parser')
#1
desired_tag = soup.find("i", text="Foo World")
#2
desired_tag.insert(0, "some_unique_string")
print(str(soup))
"""
Will show:
<div><b>Hello</b> <i>World</i></div>
<div><b>Hello</b> <i>some_unique_stringFoo World</i></div>
<div><b>Hello</b> <i>Bar World</i></div>
"""
#3
print(str(soup).find("some_unique_string"))
"""
58
"""
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With