is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after discover the last scraped item.
In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider. It succeeds to force stop, but not fast enough. It still lets some Request running.
Spider is a smart point-and-click web scraping tool. With Spider, you can turn websites into organized data, download it as JSON or spreadsheet. There's no coding experience or configuration time involved, simply open the chrome extension and start clicking. 2.0. July 26, 2022.
In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider.
In the 0.14 release note doc is mentioned: "Added CloseSpider exception to manually close spiders (r2691)"
Example as per the docs:
def parse_page(self, response): if 'Bandwidth exceeded' in response.body: raise CloseSpider('bandwidth_exceeded')
See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider
This question was asked 8 months ago but I was wondering the same thing and have found another (not great) solution. Hopefully this can help the future readers.
I'm connecting to a database in my Pipeline file, if the database connection is unsuccessful, I wanted the Spider to stop crawling (no point in collecting data if there's nowhere to send it). What I ended up doing was using:
from scrapy.project import crawler crawler._signal_shutdown(9,0) #Run this if the cnxn fails.
This causes the Spider to do the following:
[scrapy] INFO: Received SIGKILL, shutting down gracefully. Send again to force unclean shutdown.
I just kind of pieced this together after reading your comment and looking through the "/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py" file. I'm not totally sure what it's doing, the first number delivered to the function is the signame (for example, using 3,0 instead of 9,0 returns error [scrapy] INFO: Received SIGKILL...
Seems to work well enough though. Happy scraping.
EDIT: I also suppose that you could just force your program to shut down with something like:
import sys sys.exit("SHUT DOWN EVERYTHING!")
From a pipeline, I prefer the following solution.
class MongoDBPipeline(object):
def process_item(self, item, spider):
spider.crawler.engine.close_spider(self, reason='duplicate')
Source: Force spider to stop in scrapy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With