I want to parse HTML with lxml using XPath expressions. My problem is matching for the contents of a tag:
For example given the
<a href="http://something">Example</a>
element I can match the href attribute using
.//a[@href='http://something']
but the given the expression
.//a[.='Example']
or even
.//a[contains(.,'Example')]
lxml throws the 'invalid node predicate' exception.
What am I doing wrong?
EDIT:
Example code:
from lxml import etree
from cStringIO import StringIO
html = '<a href="http://something">Example</a>'
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
print tree.find(".//a[text()='Example']").tag
Expected output is 'a'. I get 'SyntaxError: invalid node predicate'
If all you have in your section of code is the element and you want the element's xpath do then element. getroottree(). getpath(element) will do the job.
XPath return valuesa float, when the XPath expression has a numeric result (integer or float) a 'smart' string (as described below), when the XPath expression has a string result. a list of items, when the XPath expression has a list as result.
I would try with:
.//a[text()='Example']
using xpath() method:
tree.xpath(".//a[text()='Example']")[0].tag
If case you would like to use iterfind(), findall(), find(), findtext(), keep in mind that advanced features like value comparison and functions are not available in ElementPath.
lxml.etree supports the simple path syntax of the find, findall and findtext methods on ElementTree and Element, as known from the original ElementTree library (ElementPath). As an lxml specific extension, these classes also provide an xpath() method that supports expressions in the complete XPath syntax, as well as custom extension functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With