Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use lxml to find an element by text?

Tags:

python

html

lxml

Assume we have the following html:

<html>     <body>         <a href="/1234.html">TEXT A</a>         <a href="/3243.html">TEXT B</a>         <a href="/7445.html">TEXT C</a>     <body> </html> 

How do I make it find the element "a", which contains "TEXT A"?

So far I've got:

root = lxml.html.document_fromstring(the_html_above) e = root.find('.//a') 

I've tried:

e = root.find('.//a[@text="TEXT A"]') 

but that didn't work, as the "a" tags have no attribute "text".

Is there any way I can solve this in a similar fashion to what I've tried?

like image 520
user1973386 Avatar asked Jan 13 '13 02:01

user1973386


People also ask

What is Xpath in lxml?

lxml. etree supports the simple path syntax of the find, findall and findtext methods on ElementTree and Element, as known from the original ElementTree library (ElementPath).

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What does lxml do in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.


2 Answers

You are very close. Use text()= rather than @text (which indicates an attribute).

e = root.xpath('.//a[text()="TEXT A"]') 

Or, if you know only that the text contains "TEXT A",

e = root.xpath('.//a[contains(text(),"TEXT A")]') 

Or, if you know only that text starts with "TEXT A",

e = root.xpath('.//a[starts-with(text(),"TEXT A")]') 

See the docs for more on the available string functions.


For example,

import lxml.html as LH  text = '''\ <html>     <body>         <a href="/1234.html">TEXT A</a>         <a href="/3243.html">TEXT B</a>         <a href="/7445.html">TEXT C</a>     <body> </html>'''  root = LH.fromstring(text) e = root.xpath('.//a[text()="TEXT A"]') print(e) 

yields

[<Element a at 0xb746d2cc>] 
like image 53
unutbu Avatar answered Sep 27 '22 00:09

unutbu


Another way that looks more straightforward to me:

results = [] root = lxml.hmtl.fromstring(the_html_above) for tag in root.iter():     if "TEXT A" in tag.text         results.append(tag) 
like image 42
ToonAlfrink Avatar answered Sep 26 '22 00:09

ToonAlfrink