Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup extract XPATH or CSS Path of node

I want to extract some data from HTML and then be able to highlight extracted elements on client side without modifying source html. And XPath or CSS Path looks great for this. Is that possible to extract XPATH or CSS Path directly from BeautifulSoup?
Right now I use marking of target element and then lxml lib to extract xpath, which is very bad for performance. I know about BSXPath.py -- it's does not work with BS4. Solution with rewriting everything to use native lxml lib is not acceptable due to complexity.

import bs4
import cStringIO
import random
from lxml import etree


def get_xpath(soup, element):
  _id = random.getrandbits(32)
  for e in soup():
    if e == element:
      e['data-xpath'] = _id
      break
  else:
    raise LookupError('Cannot find {} in {}'.format(element, soup))
  content = unicode(soup)
  doc = etree.parse(cStringIO.StringIO(content), etree.HTMLParser())
  element = doc.xpath('//*[@data-xpath="{}"]'.format(_id))
  assert len(element) == 1
  element = element[0]
  xpath = doc.getpath(element)
  return xpath

soup = bs4.BeautifulSoup('<div id=i>hello, <b id=i test=t>world!</b></div>')
xpath = get_xpath(soup, soup.div.b)
assert '//html/bodydiv/b' == xpath
like image 320
Dmytro Sadovnychyi Avatar asked Sep 22 '14 08:09

Dmytro Sadovnychyi


2 Answers

It's actually pretty easy to extract simple CSS/XPath. This is the same lxml lib gives you.

def get_element(node):
  # for XPATH we have to count only for nodes with same type!
  length = len(list(node.previous_siblings)) + 1
  if (length) > 1:
    return '%s:nth-child(%s)' % (node.name, length)
  else:
    return node.name

def get_css_path(node):
  path = [get_element(node)]
  for parent in node.parents:
    if parent.name == 'body':
      break
    path.insert(0, get_element(parent))
  return ' > '.join(path)

soup = bs4.BeautifulSoup('<div></div><div><strong><i>bla</i></strong></div>')
assert get_css_path(soup.i) == 'div:nth-child(2) > strong > i'
like image 95
Dmytro Sadovnychyi Avatar answered Oct 19 '22 15:10

Dmytro Sadovnychyi


I'm afraid the library isn't capable of that just yet. You can grab them by css path... SORTA... but, its a bit convoluted, where you're naming each element and class, an example:

soup.find("htmlelement", class_="theclass")

You can also use id's instead of classes or both if you prefer to be more specific in what you grab.

you can amend it to keep going down the path:

soup.find("htmlelement", class_="theclass").find("htmlelement2", class_="theclass2") 

so on and so forth.

There are also ways to navigate it by calling the inbuilt "next" function:

find_next("td", class_="main").find_next("td", class_="main").next.next
like image 27
CJACust Avatar answered Oct 19 '22 17:10

CJACust