BeautifulSoup extract XPATH or CSS Path of node

Question

I want to extract some data from HTML and then be able to highlight extracted elements on client side without modifying source html. And XPath or CSS Path looks great for this. Is that possible to extract XPATH or CSS Path directly from BeautifulSoup?
Right now I use marking of target element and then lxml lib to extract xpath, which is very bad for performance. I know about BSXPath.py -- it's does not work with BS4. Solution with rewriting everything to use native lxml lib is not acceptable due to complexity.

import bs4
import cStringIO
import random
from lxml import etree


def get_xpath(soup, element):
  _id = random.getrandbits(32)
  for e in soup():
    if e == element:
      e['data-xpath'] = _id
      break
  else:
    raise LookupError('Cannot find {} in {}'.format(element, soup))
  content = unicode(soup)
  doc = etree.parse(cStringIO.StringIO(content), etree.HTMLParser())
  element = doc.xpath('//*[@data-xpath="{}"]'.format(_id))
  assert len(element) == 1
  element = element[0]
  xpath = doc.getpath(element)
  return xpath

soup = bs4.BeautifulSoup('<div id=i>hello, <b id=i test=t>world!</b></div>')
xpath = get_xpath(soup, soup.div.b)
assert '//html/bodydiv/b' == xpath

Dmytro Sadovnychyi · Accepted Answer

It's actually pretty easy to extract simple CSS/XPath. This is the same lxml lib gives you.

def get_element(node):
  # for XPATH we have to count only for nodes with same type!
  length = len(list(node.previous_siblings)) + 1
  if (length) > 1:
    return '%s:nth-child(%s)' % (node.name, length)
  else:
    return node.name

def get_css_path(node):
  path = [get_element(node)]
  for parent in node.parents:
    if parent.name == 'body':
      break
    path.insert(0, get_element(parent))
  return ' > '.join(path)

soup = bs4.BeautifulSoup('<div></div><div><strong><i>bla</i></strong></div>')
assert get_css_path(soup.i) == 'div:nth-child(2) > strong > i'

CJACust · Answer

I'm afraid the library isn't capable of that just yet. You can grab them by css path... SORTA... but, its a bit convoluted, where you're naming each element and class, an example:

soup.find("htmlelement", class_="theclass")

You can also use id's instead of classes or both if you prefer to be more specific in what you grab.

you can amend it to keep going down the path:

soup.find("htmlelement", class_="theclass").find("htmlelement2", class_="theclass2")

so on and so forth.

There are also ways to navigate it by calling the inbuilt "next" function:

find_next("td", class_="main").find_next("td", class_="main").next.next

BeautifulSoup extract XPATH or CSS Path of node

Tags:

python

html

css

beautifulsoup

xpath

Dmytro Sadovnychyi

2 Answers

Dmytro Sadovnychyi

CJACust

Recent Activity

Donate For Us

BeautifulSoup extract XPATH or CSS Path of node

Tags:

python

html

css

beautifulsoup

xpath

Dmytro Sadovnychyi

2 Answers

Dmytro Sadovnychyi

CJACust

Related questions

Recent Activity

Donate For Us