Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy xpath : choose the ancestor node

I have a question about xpath

    <div id="A" >
    <div class="B">
        <div class="C">
            <div class="item">
                <div class="area">
                    <div class="sec">USA</div>
                    <table>
                        <tbody> 
                            <tr>
                                <td><a href="">D1</a></td>
                                <td>D2</td>
                            </tr>
                            <tr class="even">
                                <td><a href="">E1</a></td>
                                <td>E2</td>
                            </tr>
                        </tbody>
                    </table>
                </div>
                <div class="area">
                    <div class="sec">UK</div>
                    <table>
                        <tbody> 
                            <tr>
                                <td><a href="">F1</a></td>
                                <td>F2</td>
                            </tr>
                        </tbody>
                    </table>
                </div>
            </div>
        </div>>
    </div>
 </div>     

My code is:

sel = Selector(response)
group = sel.xpath("//div[@id='A']/div[@class='B']/div[@class='C']/div[@class='item']/div[@class='area']/table/tbody/tr")
for g in group:
    # section = g.xpath("").extract()  #ancestor???
    context = g.xpath("./td[1]/a/text()").extract()
    brief = g.xpath("./td[2]/text()").extract()
    # print section[0]
    print context[0]
    print brief[0]

it will print:

D1
D2
E1
E2
F1
F2

But I want to print :

USA
D1
D2
USA
E1
E2
UK
F1
F2

So I need to choose the value of the parent node so I can get USA and UK
I can't figure it out for a while.
Please teach me thank you!

like image 310
user2492364 Avatar asked Oct 23 '14 08:10

user2492364


People also ask

How do you use the selector in Scrapy?

Description. When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.

What is difference between ancestor and parent in XPath?

The difference between parent:: and ancestor:: axis is conveyed by their names: A parent is the immediately direct ancestor. So, yes /a/b/c/d/ancestor::*[1] would be the same as /a/b/c/d/parent::* .

How do I get text from XPath in Scrapy?

When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.


1 Answers

In XPath, you can traverse backwards a tree with .. , so a selector like this could work for you:

section = g.xpath('../../../div[@class="sec"]/text()').extract()

Although this would work, it heavily depends on the exact document structure you have. If you need a bit more flexibility, to say allow minor structural changes to the document, you could search backwards for an ancestor like this:

section = g.xpath('ancestor::div[@class="area"]/div[@class="sec"]/text()').extract()
like image 169
andrean Avatar answered Sep 20 '22 21:09

andrean