I want to extract some data from HTML and then be able to highlight extracted elements on client side without modifying source html. And XPath or CSS Path looks great for this. Is that possible to extract XPATH or CSS Path directly from BeautifulSoup?
Right now I use marking of target element and then lxml lib to extract xpath, which is very bad for performance. I know about BSXPath.py
-- it's does not work with BS4.
Solution with rewriting everything to use native lxml lib is not acceptable due to complexity.
import bs4
import cStringIO
import random
from lxml import etree
def get_xpath(soup, element):
_id = random.getrandbits(32)
for e in soup():
if e == element:
e['data-xpath'] = _id
break
else:
raise LookupError('Cannot find {} in {}'.format(element, soup))
content = unicode(soup)
doc = etree.parse(cStringIO.StringIO(content), etree.HTMLParser())
element = doc.xpath('//*[@data-xpath="{}"]'.format(_id))
assert len(element) == 1
element = element[0]
xpath = doc.getpath(element)
return xpath
soup = bs4.BeautifulSoup('<div id=i>hello, <b id=i test=t>world!</b></div>')
xpath = get_xpath(soup, soup.div.b)
assert '//html/bodydiv/b' == xpath
It's actually pretty easy to extract simple CSS/XPath. This is the same lxml lib gives you.
def get_element(node):
# for XPATH we have to count only for nodes with same type!
length = len(list(node.previous_siblings)) + 1
if (length) > 1:
return '%s:nth-child(%s)' % (node.name, length)
else:
return node.name
def get_css_path(node):
path = [get_element(node)]
for parent in node.parents:
if parent.name == 'body':
break
path.insert(0, get_element(parent))
return ' > '.join(path)
soup = bs4.BeautifulSoup('<div></div><div><strong><i>bla</i></strong></div>')
assert get_css_path(soup.i) == 'div:nth-child(2) > strong > i'
I'm afraid the library isn't capable of that just yet. You can grab them by css path... SORTA... but, its a bit convoluted, where you're naming each element and class, an example:
soup.find("htmlelement", class_="theclass")
You can also use id's instead of classes or both if you prefer to be more specific in what you grab.
you can amend it to keep going down the path:
soup.find("htmlelement", class_="theclass").find("htmlelement2", class_="theclass2")
so on and so forth.
There are also ways to navigate it by calling the inbuilt "next" function:
find_next("td", class_="main").find_next("td", class_="main").next.next
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With