Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get immediate parent node with scrapy in python?

I am new to scrapy. I want to crawl some data from the web. I got the html document like below.

dom style1:
<div class="user-info">
    <p class="user-name">
        something in p tag
    </p>
    text data I want
</div>

dom style2:
<div class="user-info">
    <div>
        <p class="user-img">
            something in p tag
        </p>
        something in div tag
    </div>
    <div>
        <p class="user-name">
            something in p tag
        </p>
        text data I want
    </div>
</div>

I want to get the data text data I want, now I can use css or xpath selector to get it by check it exists. But I want to know some better ways. For example, I can get css p.user-name first, and then I get it's parent, and then I get it's div/text(), and always the data I want is the text() of the p.user-name's immediate parent div, but the question is, how can I get the immediate parent p.user-name?

like image 827
Simon Avatar asked Dec 03 '22 13:12

Simon


1 Answers

With xpath you can traverse the xml tree in every direction(parent, sibling, child etc.) where css doesn't support this.
For your case you can get node's parent with xpath .. parent notation:

//p[@class='user-name']/../text()

Explanation:
//p[@class='user-name'] - find <p> nodes with class value user-name.
/.. - select node's parent.
/text() - select text of the current node.

This xpath should work in both of your described cases.

like image 90
Granitosaurus Avatar answered Dec 25 '22 10:12

Granitosaurus