First of all, is it possible to do such thing?
I have been trying out to generate Xpath expression by using "sub-element text values" present in webpage. Trying to do this using lxml (etree, html, getpath), ElementTree modules in Python. But I don't know how to generate Xpath expression for a value present in the webpage. I totally know about Scrapy framework in python, but this is different.
Below is the my incomplete code..
import urllib2, re
from lxml import etree
def wgetUrl(target):
try:
req = urllib2.Request(target)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
outtxt = response.read()
response.close()
except:
return ''
return outtxt
newUrl = 'http://www.iupui.edu/~webtrain/tutorials/tables.html' # homepage
dt = wgetUrl(newUrl)
parser = etree.HTMLParser()
tree = etree.fromstring(dt, parser)
As per the lxml documentation they are creating element tree manually, but how can I use my read and parsed html data (in my example variable tree or data) to access the sub-element. Or more importantly, if possible the sub-element text value.
Let's say in the above example webpage, I want to search table "Supplies and Expenses" and generate Xpath expression dynamically by that value - Supplies and Expenses
Is there any option to do so !!! Ultimate goal, I would like to achieve is read webpage and generate Xpath for the sub-element text value present in webpage.
To find all elements based on a part of their text value:
"//*[contains(text(), 'some_value')]"
For example, if you have this:
<div id="somediv">
<span>Something is here</span>
<a href="#">Click here</a>
</div>
You can find all sub-elements containing the word "here" like this:
"//div[@id='somediv']//*[contains(text(), 'here')]"
Or you can for example find all sub-div span elements containing the word "Something":
"//div[@id='somediv']//span[contains(text(), 'Something')]"
As for parsing this in lxml:
from lxml import etree
outtxt = response.read()
root = etree.fromstring(outtxt)
root.xpath("my_xpath_expression")
Update:
To get the full XPath expression for an element, use the ElementTree.getPath() method, like so:
tree = etree.ElementTree(root)
# this will print XPath of all
# elements in 'root'
for e in root.iter():
print tree.getpath(e)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With