I have a html like this
<div id="all-stories" class="book">
<ul>
<li title="Book1" ><a href="book1_url">Book1</a></li>
<li title="Book2" ><a href="book2_url">Book2</a></li>
</ul>
</div>
I want to get the books and their respective url using xpath, but it seems my approach is not working. for simplicity i tried to extract all the elements under "li " tags as follows
lis = tree.xpath('//div[@id="all-stories"]/div/text()')
import lxml.html as LH
content = '''\
<div id="all-stories" class="book">
<ul>
<li title="Book1" ><a href="book1_url">Book1</a></li>
<li title="Book2" ><a href="book2_url">Book2</a></li>
</ul>
</div>
'''
root = LH.fromstring(content)
for atag in root.xpath('//div[@id="all-stories"]//li/a'):
print(atag.attrib['href'], atag.text_content())
yields
('book1_url', 'Book1')
('book2_url', 'Book2')
The XPath //div[@id="all-stories"]/div
does not match anything because there is no child div
inside the outer div
tag.
The XPath //div[@id="all-stories"]/li
also would not match because the there is no direct child li
tage inside the div
tag. However, //div[@id="all-stories"]//li
does match li
tags because //
tells XPath to recursively search as deeply as necessary to find the li
tags.
Now, the content you are looking for is not in the li
tag. It is inside the a
tag. So instead use the XPath
'//div[@id="all-stories"]//li/a'
to reach the a
tags.
The value of the href
attribute can be accessed with atag.attrib['href']
, and the text with atag.text_content()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With