Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract original string position from beautifulsoup element

When parsing long complicated html documents with beautifulsoup, it's sometimes useful to get the exact position in the original string where I've matched an element. I can't simply search for the string, as there may be multiple matching elements and I would lose bs4's ability to parse the DOM. Given this minimal working example:

import bs4

html = "<div><b>Hello</b>  <i>World</i></div>"
soup = bs4.BeautifulSoup(html,'lxml')

# Returns 22
print html.find("World")

# How to get this to return 22?
print soup.find("i", text="World")

How can I get the element extracted by bs4 to return 22?

like image 801
Hooked Avatar asked Jan 12 '18 16:01

Hooked


1 Answers

I understand your problem is "World" might be written many times, but you want to obtain the position of an specific occurrence (that you, somehow, know how to identify).

You can use this workaround. I bet there are more elegant solutions, but this should make it:

Given this html:

import bs4

html = """<div><b>Hello</b>  <i>World</i></div>
          <div><b>Hello</b>  <i>Foo World</i></div>
          <div><b>Hello</b>  <i>Bar World</i></div>"""

soup = bs4.BeautifulSoup(html,'lxml')

If we want to obtain the position of the Foo World occurence we can:

  1. Get the tag
  2. Introduce some unique string that we know it's not present in the rest of the html
  3. Get the position of the string we added

    import bs4
    
    html = """<div><b>Hello</b>  <i>World</i></div>
              <div><b>Hello</b>  <i>Foo World</i></div>
              <div><b>Hello</b>  <i>Bar World</i></div>"""
    
    soup = bs4.BeautifulSoup(html,'html.parser')
    
    #1
    desired_tag = soup.find("i", text="Foo World")
    #2
    desired_tag.insert(0, "some_unique_string")
    
    print(str(soup))
    """
    Will show:
    <div><b>Hello</b> <i>World</i></div>
    <div><b>Hello</b> <i>some_unique_stringFoo World</i></div>
    <div><b>Hello</b> <i>Bar World</i></div>
    """
    
    #3   
    print(str(soup).find("some_unique_string"))
    """
    58
    """
    
like image 108
Pablo M Avatar answered Sep 17 '22 20:09

Pablo M