I am new on Scrapy, and I am sorry if this question is trivial. I have read the document on Scrapy from official webpage. And while I look through the document, I met this example:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = ’example.com’
allowed_domains = [’example.com’]
start_urls = [
’http://www.example.com/1.html’,
’http://www.example.com/2.html’,
’http://www.example.com/3.html’,
]
def parse(self, response):
for h3 in response.xpath(’//h3’).extract():
yield MyItem(title=h3)
for url in response.xpath(’//a/@href’).extract():
yield scrapy.Request(url, callback=self.parse)
I know, the parse method must return an item or/and request, but where are these return values returned to?
One is an item and the other is request, I think these two type would be handled differently and in the case of CrawlSpider
, it has Rule with callback. What about this callback's return value? where to ? same as parse()
?
I am very confused on Scrapy procedure, even i read the document....
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
This tutorial explains how to use yield in Scrapy. You can use regular methods such as printing and logging or using regular file handling methods to save the data returned from the Scrapy Spider. However, Scrapy offers an inbuilt way of saving and storing data through the yield keyword.
start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that.
According to the documentation:
The parse() method is in charge of processing the response and returning scraped data (as Item objects) and more URLs to follow (as Request objects).
In other words, returned/yielded items and requests are handled differently, items are being handed to the item pipelines and item exporters, but requests are being put into the Scheduler
which pipes the requests to the Downloader
for making a request and returning a response. Then, the engine receives the response and gives it to the spider for processing (to the callback
method).
The whole data-flow process is described in the Architecture Overview page in a very detailed manner.
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With