Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Raise close spider from Scrapy pipeline

Tags:

python

scrapy

I need to raise CloseSpider from a Scrapy Pipeline. Either that or return some parameter from the Pipeline back to the Spider to do the raise.

For example, if the date already exists raise CloseSpider:

raise CloseSpider('Already been scraped:' + response.url)

Is there a way to do this?

like image 445
MoreScratch Avatar asked May 20 '18 04:05

MoreScratch


People also ask

How do you close a Scrapy spider?

To force spider to close you can use raise CloseSpider exception as described here in scrapy docs. Just be sure to return/yield your items before you raise the exception.

How do you run multiple spiders in a Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

How does a Scrapy pipeline work?

Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.

How do you activate the pipeline in Scrapy?

You can activate an Item Pipeline component by adding its class to the ITEM_PIPELINES setting as shown in the following code. You can assign integer values to the classes in the order in which they run (the order can be lower valued to higher valued classes) and values will be in the 0-1000 range.


2 Answers

As from scrapy docs, CloseSpider Exception can only be raised from a callback function (by default parse function) in a Spider only. Raising it in pipeline will crash spider. To achieve the similar results from a pipeline, you can initiate a shutdown signal, that will close scrapy gracefully.

from scrapy.project import crawler  
crawler._signal_shutdown(9,0)

Do remember ,scrapy might process already fired or even scheduled requests even after initiating shutdown signal.

To do it from Spider, set some variable in Spider from Pipeline like this.

def process_item(self, item, spider):
    if some_condition_is_met:
        spider.close_manually = True

After this in the callback function of your spider , you can raise close spider exception.

def parse(self, response):
    if self.close_manually:
        raise CloseSpider('Already been scraped.')
like image 64
Ahsan Roy Avatar answered Oct 22 '22 06:10

Ahsan Roy


I prefer the following solution.

class MongoDBPipeline(object):

def process_item(self, item, spider):
    spider.crawler.engine.close_spider(self, reason='duplicate')

Source: Force spider to stop in scrapy

like image 22
Macbric Avatar answered Oct 22 '22 05:10

Macbric