I have this HTML snippet
<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>
<ul class="toc">
<li class="level1"><div class="li"><a href="#section">#</a></div>
<ul class="toc">
<li class="level2"><div class="li"><a href="#link1">One</a></div></li>
<li class="level2"><div class="li"><a href="#link2">Two</a></div></li>
<li class="level2"><div class="li"><a href="#link3">Three</a></div></li>
Now I want to parse it with lxml.html. In the end I want a function where I can provide a searchterm (i.e. "one") and the function should return
One
#link1
For now I'm trying to get a variable in the XPath.
Works:
import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")
print test
Trying with variable. I want to replace the hardcoded 'One'
with a variable which I can return to the function later.
Doesn't work:
import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")
desiredvars = ['One']
myresultset=((var, html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='%s']"%(var))[0]) for var in desiredvars)
for each in myresultset:
print each
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <genexpr>
IndexError: list index out of range
This is based on this answer: https://stackoverflow.com/a/10688235/2320453 Any idea why it doesn't work? Is this the "right way" to do something like this?
EDIT: To sum things up: I want to search within the a-Tags and get the text from this Attributes, but I don't want a complete list instead I want to be able to search with a variable. Pseudo-code:
import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")
searchterm = 'one'
test=html.xpath("...a/text()=searchterm")
print test
Expected result
One
#link1
Your first example woks, but probably not how you think it shoud:
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")
What this returns is a boolean, which will be true if the condition ...='One'
is true for any of the nodes in the result set at the left side of the xpath expression. And that's why you get the error in your second example: True[0]
is not valid.
You probalby want all nodes matching the expession, having 'One'
as text. The corresponding expression would be:
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']")
This returns a nodeset as result, or if you just need the url as a string:
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']/@href")
# returns: ['#link1']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With