I have a question about xpath
<div id="A" >
<div class="B">
<div class="C">
<div class="item">
<div class="area">
<div class="sec">USA</div>
<table>
<tbody>
<tr>
<td><a href="">D1</a></td>
<td>D2</td>
</tr>
<tr class="even">
<td><a href="">E1</a></td>
<td>E2</td>
</tr>
</tbody>
</table>
</div>
<div class="area">
<div class="sec">UK</div>
<table>
<tbody>
<tr>
<td><a href="">F1</a></td>
<td>F2</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>>
</div>
</div>
My code is:
sel = Selector(response)
group = sel.xpath("//div[@id='A']/div[@class='B']/div[@class='C']/div[@class='item']/div[@class='area']/table/tbody/tr")
for g in group:
# section = g.xpath("").extract() #ancestor???
context = g.xpath("./td[1]/a/text()").extract()
brief = g.xpath("./td[2]/text()").extract()
# print section[0]
print context[0]
print brief[0]
it will print:
D1
D2
E1
E2
F1
F2
But I want to print :
USA
D1
D2
USA
E1
E2
UK
F1
F2
So I need to choose the value of the parent node so I can get USA
and UK
I can't figure it out for a while.
Please teach me thank you!
Description. When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.
The difference between parent:: and ancestor:: axis is conveyed by their names: A parent is the immediately direct ancestor. So, yes /a/b/c/d/ancestor::*[1] would be the same as /a/b/c/d/parent::* .
When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.
In XPath, you can traverse backwards a tree with ..
, so a selector like this could work for you:
section = g.xpath('../../../div[@class="sec"]/text()').extract()
Although this would work, it heavily depends on the exact document structure you have. If you need a bit more flexibility, to say allow minor structural changes to the document, you could search backwards for an ancestor like this:
section = g.xpath('ancestor::div[@class="area"]/div[@class="sec"]/text()').extract()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With