I am learning scrapy with the tutorial: http://doc.scrapy.org/en/1.0/intro/tutorial.html
When I run the following example script in the tutorial. I found that even though it was already looping through the selector list, the tile I got from sel.xpath('a/text()').extract()
was still a list, which contained one string. Like [u'Python 3 Object Oriented Programming']
rather than u'Python 3 Object Oriented Programming'
. In a later example the list is assigned to item as item['title'] = sel.xpath('a/text()').extract()
, which I think is not logically correct.
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
However if I use the following code:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/",
]
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
link = href.extract()
print(link)
the link
is a string rather than a list.
Is this a bug or intended?
When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.
Description. When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.
Scrapy comes with its own mechanism for extracting data. They're called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.
.xpath().extract()
and .css().extract()
return a list because .xpath()
and .css()
return SelectorList
objects.
See https://parsel.readthedocs.org/en/v1.0.1/usage.html#parsel.selector.SelectorList.extract
(SelectorList) .extract():
Call the .extract() method for each element is this list and return their results flattened, as a list of unicode strings.
.extract_first()
is what you are looking for (which is poorly documented)
Taken from http://doc.scrapy.org/en/latest/topics/selectors.html :
If you want to extract only first matched element, you can call the selector
.extract_first()
>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '
In your other example:
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
link = href.extract()
print(link)
each href
in the loop will be a Selector
object. Calling .extract()
on it will get you a single Unicode string back:
$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/"
2016-02-26 12:11:36 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
In [1]: response.css("ul.directory.dir-col > li > a::attr('href')")
Out[1]:
[<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>,
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>,
...
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>]
so .css()
on the response
returns a SelectorList
:
In [2]: type(response.css("ul.directory.dir-col > li > a::attr('href')"))
Out[2]: scrapy.selector.unified.SelectorList
Looping on that object gives you Selector
instances:
In [5]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
...: print href
...:
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
(...)
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
And calling .extract()
gives you a single Unicode string:
In [6]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
print type(href.extract())
...:
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
Note: .extract()
on Selector
is wrongly documented as returning a list of strings. I'll open an issue on parsel
(which is the same as Scrapy selectors, and used under the hood in scrapy 1.1+)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With