Force my scrapy spider to stop crawling

Tags:

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after discover the last scraped item.

488

asked Dec 15 '10 10:12

no1

3 Answers

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider.

In the 0.14 release note doc is mentioned: "Added CloseSpider exception to manually close spiders (r2691)"

Example as per the docs:

def parse_page(self, response):   if 'Bandwidth exceeded' in response.body:     raise CloseSpider('bandwidth_exceeded')

See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider

answered Oct 09 '22 01:10

Sjaak Trekhaak

This question was asked 8 months ago but I was wondering the same thing and have found another (not great) solution. Hopefully this can help the future readers.

I'm connecting to a database in my Pipeline file, if the database connection is unsuccessful, I wanted the Spider to stop crawling (no point in collecting data if there's nowhere to send it). What I ended up doing was using:

from scrapy.project import crawler crawler._signal_shutdown(9,0) #Run this if the cnxn fails.

This causes the Spider to do the following:

[scrapy] INFO: Received SIGKILL, shutting down gracefully. Send again to force unclean shutdown.

I just kind of pieced this together after reading your comment and looking through the "/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py" file. I'm not totally sure what it's doing, the first number delivered to the function is the signame (for example, using 3,0 instead of 9,0 returns error [scrapy] INFO: Received SIGKILL...

Seems to work well enough though. Happy scraping.

EDIT: I also suppose that you could just force your program to shut down with something like:

import sys sys.exit("SHUT DOWN EVERYTHING!")

answered Oct 09 '22 01:10

alukach

From a pipeline, I prefer the following solution.

class MongoDBPipeline(object):

def process_item(self, item, spider):
    spider.crawler.engine.close_spider(self, reason='duplicate')

Source: Force spider to stop in scrapy

answered Oct 09 '22 01:10

Macbric

Related questions
                            
                                How to add extra whitespace between section header and a paragraph
                            
                                Modify the default font in Python Tkinter
                            
                                Pycharm exit code 0
                            
                                Ansible. override single dictionary key [duplicate]
                            
                                Wait for all multiprocessing jobs to finish before continuing
                            
                                Getting an error - AttributeError: 'module' object has no attribute 'run' while running subprocess.run(["ls", "-l"])
                            
                                S3 Connection timeout when using boto3
                            
                                How to peek front of deque without popping?
                            
                                Python 2.x - Write binary output to stdout?
                            
                                How can I make a simple 3D line with Matplotlib?
                            
                                How dangerous is setting self.__class__ to something else?
                            
                                Need to execute a function after returning the response in Flask
                            
                                Integer division by negative number [duplicate]
                            
                                Finding red color in image using Python & OpenCV
                            
                                List with many dictionaries VS dictionary with few lists?
                            
                                How to extract multiple slices in an array?
                            
                                Pandas - find first non-null value in column
                            
                                Extending builtin classes in python
                            
                                Django unit testing with date/time-based objects
                            
                                how do I determine whether a python script is imported as module or run as script?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Force my scrapy spider to stop crawling

Tags:

python

scrapy

no1

People also ask

3 Answers

Sjaak Trekhaak

alukach

Macbric

Recent Activity

Donate For Us