Assume we have the following html: <pre class="prettyprint"><code><html> <body> <a href="/1234.html">TEXT A</a> <a href="/3243.html">TEXT B</a> <a href="/7445.html">TEXT C</a> <body> </html> </code></pre> How do I make it find the element "a", which contains "TEXT A"? So far I've got: <pre class="prettyprint lang-py prettyprint-override"><code>root = lxml.html.document_fromstring(the_html_above) e = root.find('.//a') </code></pre> I've tried: <pre class="prettyprint"><code>e = root.find('.//a[@text="TEXT A"]') </code></pre> but that didn't work, as the "a" tags have no attribute "text". Is there any way I can solve this in a similar fashion to what I've tried?

You are very close. Use <code>text()=</code> rather than <code>@text</code> (which indicates an attribute). <pre class="prettyprint"><code>e = root.xpath('.//a[text()="TEXT A"]') </code></pre> Or, if you know only that the text contains "TEXT A", <pre class="prettyprint"><code>e = root.xpath('.//a[contains(text(),"TEXT A")]') </code></pre> Or, if you know only that text starts with "TEXT A", <pre class="prettyprint"><code>e = root.xpath('.//a[starts-with(text(),"TEXT A")]') </code></pre> See the docs for more on the available string functions. <hr> For example, <pre class="prettyprint"><code>import lxml.html as LH text = '''\ <html> <body> <a href="/1234.html">TEXT A</a> <a href="/3243.html">TEXT B</a> <a href="/7445.html">TEXT C</a> <body> </html>''' root = LH.fromstring(text) e = root.xpath('.//a[text()="TEXT A"]') print(e) </code></pre> yields <pre class="prettyprint"><code>[<Element a at 0xb746d2cc>] </code></pre>

Another way that looks more straightforward to me: <pre class="prettyprint"><code>results = [] root = lxml.hmtl.fromstring(the_html_above) for tag in root.iter(): if "TEXT A" in tag.text results.append(tag) </code></pre>

How to use lxml to find an element by text?

<html>     <body>         <a href="/1234.html">TEXT A</a>         <a href="/3243.html">TEXT B</a>         <a href="/7445.html">TEXT C</a>     <body> </html>

How do I make it find the element "a", which contains "TEXT A"?

So far I've got:

root = lxml.html.document_fromstring(the_html_above) e = root.find('.//a')

I've tried:

e = root.find('.//a[@text="TEXT A"]')

but that didn't work, as the "a" tags have no attribute "text".

Is there any way I can solve this in a similar fashion to what I've tried?

520

asked Jan 13 '13 02:01

user1973386

2 Answers

You are very close. Use text()= rather than @text (which indicates an attribute).

e = root.xpath('.//a[text()="TEXT A"]')

Or, if you know only that the text contains "TEXT A",

e = root.xpath('.//a[contains(text(),"TEXT A")]')

Or, if you know only that text starts with "TEXT A",

e = root.xpath('.//a[starts-with(text(),"TEXT A")]')

See the docs for more on the available string functions.

For example,

import lxml.html as LH  text = '''\ <html>     <body>         <a href="/1234.html">TEXT A</a>         <a href="/3243.html">TEXT B</a>         <a href="/7445.html">TEXT C</a>     <body> </html>'''  root = LH.fromstring(text) e = root.xpath('.//a[text()="TEXT A"]') print(e)

yields

[<Element a at 0xb746d2cc>]

answered Sep 27 '22 00:09

unutbu

Another way that looks more straightforward to me:

results = [] root = lxml.hmtl.fromstring(the_html_above) for tag in root.iter():     if "TEXT A" in tag.text         results.append(tag)

answered Sep 26 '22 00:09

ToonAlfrink

Related questions
                            
                                Resizing and stretching a NumPy array
                            
                                Calling app from subprocess.call with arguments
                            
                                How to profile cython functions line-by-line
                            
                                How to import OpenSSL in python
                            
                                Get all HTML tags with Beautiful Soup
                            
                                pandas DataFrame diagonal
                            
                                python: lambda, yield-statement/expression and loops (Clarify)
                            
                                Inheriting from a namedtuple base class
                            
                                Groupby class and count missing values in features
                            
                                Could not find a version that satisfies the requirement torch>=1.0.0?
                            
                                Is there any python library for parsing dates and times from a natural language? [closed]
                            
                                changing order of unit tests in Python
                            
                                How can I spawn new shells to run Python scripts from a base Python script?
                            
                                Iterate through Python dictionary by Keys in sorted order [duplicate]
                            
                                Keyboard interrupt in debug mode PyCharm
                            
                                Get inserted key before commit session
                            
                                Multiple output and numba signatures
                            
                                return value from python script to shell script
                            
                                How do you solve the error KeyError: 'A secret key is required to use CSRF.' when using a wtform in flask application?
                            
                                How to override constructor of Python class with many arguments?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use lxml to find an element by text?

Tags:

python

html

lxml

user1973386

People also ask

2 Answers

unutbu

ToonAlfrink

Recent Activity

Donate For Us