I want to parse HTML with lxml using XPath expressions. My problem is matching for the contents of a tag: For example given the <pre class="prettyprint"><code><a href="http://something">Example</a> </code></pre> element I can match the href attribute using <pre class="prettyprint"><code>.//a[@href='http://something'] </code></pre> but the given the expression <pre class="prettyprint"><code>.//a[.='Example'] </code></pre> or even <pre class="prettyprint"><code>.//a[contains(.,'Example')] </code></pre> lxml throws the 'invalid node predicate' exception. What am I doing wrong? EDIT: Example code: <pre class="prettyprint"><code>from lxml import etree from cStringIO import StringIO html = '<a href="http://something">Example</a>' parser = etree.HTMLParser() tree = etree.parse(StringIO(html), parser) print tree.find(".//a[text()='Example']").tag </code></pre> Expected output is 'a'. I get 'SyntaxError: invalid node predicate'

I would try with: <code>.//a[text()='Example']</code> using xpath() method: <pre class="prettyprint"><code>tree.xpath(".//a[text()='Example']")[0].tag </code></pre> If case you would like to use iterfind(), findall(), find(), findtext(), keep in mind that advanced features like value comparison and functions are not available in ElementPath. <blockquote> lxml.etree supports the simple path syntax of the find, findall and findtext methods on ElementTree and Element, as known from the original ElementTree library (ElementPath). As an lxml specific extension, these classes also provide an xpath() method that supports expressions in the complete XPath syntax, as well as custom extension functions. </blockquote>

How do I match contents of an element in XPath (lxml)?

Tags:

python

xpath

lxml

predicate

I want to parse HTML with lxml using XPath expressions. My problem is matching for the contents of a tag:

For example given the

<a href="http://something">Example</a>

element I can match the href attribute using

.//a[@href='http://something']

but the given the expression

.//a[.='Example']

or even

.//a[contains(.,'Example')]

lxml throws the 'invalid node predicate' exception.

What am I doing wrong?

EDIT:

Example code:

from lxml import etree
from cStringIO import StringIO

html = '<a href="http://something">Example</a>'
parser = etree.HTMLParser()
tree   = etree.parse(StringIO(html), parser)

print tree.find(".//a[text()='Example']").tag

Expected output is 'a'. I get 'SyntaxError: invalid node predicate'

552

asked Apr 14 '10 13:04

akosch

1 Answers

I would try with:

.//a[text()='Example']

using xpath() method:

tree.xpath(".//a[text()='Example']")[0].tag

If case you would like to use iterfind(), findall(), find(), findtext(), keep in mind that advanced features like value comparison and functions are not available in ElementPath.

lxml.etree supports the simple path syntax of the find, findall and findtext methods on ElementTree and Element, as known from the original ElementTree library (ElementPath). As an lxml specific extension, these classes also provide an xpath() method that supports expressions in the complete XPath syntax, as well as custom extension functions.

126

answered Oct 07 '22 01:10

systempuntoout

Related questions
                            
                                In Python, how is the in operator implemented to work? Does it use the next() method of the iterators?
                            
                                AttributeError when using ColumnTransformer into a pipeline
                            
                                Django can' t load Module 'debug_toolbar': No module named 'debug_toolbar'
                            
                                pymysql.err.InterfaceError: (0, '') error when doing a lot of pushes to sql table
                            
                                How to properly structure internal scripts in a Python project?
                            
                                Most pythonic callable generating True?
                            
                                Getting the literal out of a python Literal type, at runtime?
                            
                                Using YOLO or other image recognition techniques to identify all alphanumeric text present in images
                            
                                JSON to Protobuf in Python
                            
                                How to format requirements.txt when package source is from specific websites?
                            
                                Airflow - got an unexpected keyword argument 'conf'
                            
                                Gensim 3.8.0 to Gensim 4.0.0
                            
                                Zappa deploy fails with AttributeError: 'Template' object has no attribute 'add_description'
                            
                                Python: How to detect debug interpreter
                            
                                Merge sorted lists in python
                            
                                Django ORM for desktop application
                            
                                Can't open Unicode URL with Python
                            
                                In python, what does len(list) do?
                            
                                Django: Paginator + raw SQL query
                            
                                Getting last insert id with SQLAlchemy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With