Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does xpath inside a selector loop still return a list in the tutorial

Tags:

xpath

scrapy

I am learning scrapy with the tutorial: http://doc.scrapy.org/en/1.0/intro/tutorial.html

When I run the following example script in the tutorial. I found that even though it was already looping through the selector list, the tile I got from sel.xpath('a/text()').extract() was still a list, which contained one string. Like [u'Python 3 Object Oriented Programming'] rather than u'Python 3 Object Oriented Programming'. In a later example the list is assigned to item as item['title'] = sel.xpath('a/text()').extract(), which I think is not logically correct.

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            title = sel.xpath('a/text()').extract()
            link = sel.xpath('a/@href').extract()
            desc = sel.xpath('text()').extract()
            print title, link, desc

However if I use the following code:

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/",
    ]

    def parse(self, response):
        for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
            link = href.extract()
            print(link)

the link is a string rather than a list.

Is this a bug or intended?

like image 542
entron Avatar asked Feb 26 '16 10:02

entron


People also ask

How do I get text from XPath in Scrapy?

When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.

How do you use the selector in Scrapy?

Description. When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.

What is the Scrapy method that you can call to retrieve the contents of the selected node in XPath?

Scrapy comes with its own mechanism for extracting data. They're called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.


1 Answers

.xpath().extract() and .css().extract() return a list because .xpath() and .css() return SelectorList objects.

See https://parsel.readthedocs.org/en/v1.0.1/usage.html#parsel.selector.SelectorList.extract

(SelectorList) .extract():

Call the .extract() method for each element is this list and return their results flattened, as a list of unicode strings.

.extract_first() is what you are looking for (which is poorly documented)

Taken from http://doc.scrapy.org/en/latest/topics/selectors.html :

If you want to extract only first matched element, you can call the selector .extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '

In your other example:

def parse(self, response):
    for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
        link = href.extract()
        print(link)

each href in the loop will be a Selector object. Calling .extract() on it will get you a single Unicode string back:

$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/"
2016-02-26 12:11:36 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
In [1]: response.css("ul.directory.dir-col > li > a::attr('href')")
Out[1]: 
[<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>,
 <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>,
 ...
 <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>]

so .css() on the response returns a SelectorList:

In [2]: type(response.css("ul.directory.dir-col > li > a::attr('href')"))
Out[2]: scrapy.selector.unified.SelectorList

Looping on that object gives you Selector instances:

In [5]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
   ...:     print href
   ...:     
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
(...)
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>

And calling .extract() gives you a single Unicode string:

In [6]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
    print type(href.extract())
   ...:     
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>

Note: .extract() on Selector is wrongly documented as returning a list of strings. I'll open an issue on parsel (which is the same as Scrapy selectors, and used under the hood in scrapy 1.1+)

like image 120
paul trmbrth Avatar answered Sep 28 '22 10:09

paul trmbrth