Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath for sub-element's text value in lxml

First of all, is it possible to do such thing?

I have been trying out to generate Xpath expression by using "sub-element text values" present in webpage. Trying to do this using lxml (etree, html, getpath), ElementTree modules in Python. But I don't know how to generate Xpath expression for a value present in the webpage. I totally know about Scrapy framework in python, but this is different.

Below is the my incomplete code..

import urllib2, re
from lxml import etree

def wgetUrl(target):
    try:
        req = urllib2.Request(target)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
        response = urllib2.urlopen(req)
        outtxt = response.read()
        response.close()
    except:
        return ''
    return outtxt


newUrl = 'http://www.iupui.edu/~webtrain/tutorials/tables.html' # homepage

dt = wgetUrl(newUrl)
parser = etree.HTMLParser()
tree   = etree.fromstring(dt, parser)

As per the lxml documentation they are creating element tree manually, but how can I use my read and parsed html data (in my example variable tree or data) to access the sub-element. Or more importantly, if possible the sub-element text value.

Let's say in the above example webpage, I want to search table "Supplies and Expenses" and generate Xpath expression dynamically by that value - Supplies and Expenses

Is there any option to do so !!! Ultimate goal, I would like to achieve is read webpage and generate Xpath for the sub-element text value present in webpage.

like image 221
techDiscussion Avatar asked Apr 05 '26 19:04

techDiscussion


1 Answers

To find all elements based on a part of their text value:

"//*[contains(text(), 'some_value')]"

For example, if you have this:

<div id="somediv">
    <span>Something is here</span>
    <a href="#">Click here</a>
</div>

You can find all sub-elements containing the word "here" like this:

"//div[@id='somediv']//*[contains(text(), 'here')]"

Or you can for example find all sub-div span elements containing the word "Something":

"//div[@id='somediv']//span[contains(text(), 'Something')]"

As for parsing this in lxml:

from lxml import etree
outtxt = response.read()
root = etree.fromstring(outtxt)
root.xpath("my_xpath_expression")

Update:

To get the full XPath expression for an element, use the ElementTree.getPath() method, like so:

tree = etree.ElementTree(root)
# this will print XPath of all
# elements in 'root'
for e in root.iter():
    print tree.getpath(e)
like image 124
bosnjak Avatar answered Apr 08 '26 08:04

bosnjak



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!