Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Force my scrapy spider to stop crawling

Tags:

python

scrapy

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after discover the last scraped item.

like image 488
no1 Avatar asked Dec 15 '10 10:12

no1


People also ask

How do you stop a spider from being Scrapy?

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider. It succeeds to force stop, but not fast enough. It still lets some Request running.

What is spider in web scraping?

Spider is a smart point-and-click web scraping tool. With Spider, you can turn websites into organized data, download it as JSON or spreadsheet. There's no coding experience or configuration time involved, simply open the chrome extension and start clicking. 2.0. July 26, 2022.


3 Answers

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider.

In the 0.14 release note doc is mentioned: "Added CloseSpider exception to manually close spiders (r2691)"

Example as per the docs:

def parse_page(self, response):   if 'Bandwidth exceeded' in response.body:     raise CloseSpider('bandwidth_exceeded') 

See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider

like image 63
Sjaak Trekhaak Avatar answered Oct 09 '22 01:10

Sjaak Trekhaak


This question was asked 8 months ago but I was wondering the same thing and have found another (not great) solution. Hopefully this can help the future readers.

I'm connecting to a database in my Pipeline file, if the database connection is unsuccessful, I wanted the Spider to stop crawling (no point in collecting data if there's nowhere to send it). What I ended up doing was using:

from scrapy.project import crawler crawler._signal_shutdown(9,0) #Run this if the cnxn fails. 

This causes the Spider to do the following:

[scrapy] INFO: Received SIGKILL, shutting down gracefully. Send again to force unclean shutdown. 

I just kind of pieced this together after reading your comment and looking through the "/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py" file. I'm not totally sure what it's doing, the first number delivered to the function is the signame (for example, using 3,0 instead of 9,0 returns error [scrapy] INFO: Received SIGKILL...

Seems to work well enough though. Happy scraping.

EDIT: I also suppose that you could just force your program to shut down with something like:

import sys sys.exit("SHUT DOWN EVERYTHING!") 
like image 33
alukach Avatar answered Oct 09 '22 01:10

alukach


From a pipeline, I prefer the following solution.

class MongoDBPipeline(object):

def process_item(self, item, spider):
    spider.crawler.engine.close_spider(self, reason='duplicate')

Source: Force spider to stop in scrapy

like image 26
Macbric Avatar answered Oct 09 '22 01:10

Macbric