Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get list items inside div tag using xpath

Tags:

python

xpath

lxml

I have a html like this

<div id="all-stories" class="book"> 
<ul>

<li title="Book1"  ><a href="book1_url">Book1</a></li>

<li title="Book2"  ><a href="book2_url">Book2</a></li>
</ul>

</div>

I want to get the books and their respective url using xpath, but it seems my approach is not working. for simplicity i tried to extract all the elements under "li " tags as follows

lis = tree.xpath('//div[@id="all-stories"]/div/text()')
like image 782
Anurag Sharma Avatar asked Jun 29 '13 13:06

Anurag Sharma


1 Answers

import lxml.html as LH

content = '''\
<div id="all-stories" class="book"> 
<ul>

<li title="Book1"  ><a href="book1_url">Book1</a></li>

<li title="Book2"  ><a href="book2_url">Book2</a></li>
</ul>

</div>
'''
root = LH.fromstring(content)
for atag in root.xpath('//div[@id="all-stories"]//li/a'):
    print(atag.attrib['href'], atag.text_content())

yields

('book1_url', 'Book1')
('book2_url', 'Book2')

The XPath //div[@id="all-stories"]/div does not match anything because there is no child div inside the outer div tag.

The XPath //div[@id="all-stories"]/li also would not match because the there is no direct child li tage inside the div tag. However, //div[@id="all-stories"]//li does match li tags because // tells XPath to recursively search as deeply as necessary to find the li tags.

Now, the content you are looking for is not in the li tag. It is inside the a tag. So instead use the XPath '//div[@id="all-stories"]//li/a' to reach the a tags. The value of the href attribute can be accessed with atag.attrib['href'], and the text with atag.text_content().

like image 142
unutbu Avatar answered Sep 20 '22 14:09

unutbu