Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop the reactor while several scrapy spiders are running in the same process

I have read from here and here, and made multiple spiders running in the same process work.

However, I don't know how to design a signal system to stop the reactor when all spiders are finished

My code is quite similar with the following example:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider

def setup_crawler(domain):
    spider = FollowAllSpider(domain=domain)
    crawler = Crawler(Settings())
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()

for domain in ['scrapinghub.com', 'insophia.com']:
    setup_crawler(domain)
log.start()
reactor.run()

After all the crawler stops, the reactor is still running. If I add the statement

crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

to the setup_crawler function, reactor stops when first crawler closed.

Can any body show me howto make the reactor stops when all the crawler finished?

like image 774
user2776549 Avatar asked Sep 13 '13 13:09

user2776549


People also ask

How do you use multiple spiders in Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

How does Scrapy spider work?

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

How do you run a Scrapy spider from a Python script?

The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.


2 Answers

Further to shackra's answer, taking that route does work. You can create the signal receiver as a closure which retains state, which means that it keeps record of the amount of spiders that have completed. Your code should know how many spiders you are running, so it should be a simple matter of checking when all have run, and then running reactor.stop().

e.g

Link the signal receiver to your crawler:

crawler.signals.connect(spider_finished, signal=signals.spider_closed)

Create the signal receiver:

def spider_finished_count():
    spider_finished_count.count = 0

    def inc_count(spider, reason):
        spider_finished_count.count += 1
        if spider_finished_count.count == NUMBER_OF_SPIDERS:
            reactor.stop()
    return inc_count
spider_finished = spider_finished_count()

NUMBER_OF_SPIDERS being the total number of spiders you are running in this process.

Or you could do it the other way around and count down from the number of spiders running to 0. Or more complex solutions could involve keeping a dict of which spiders have and have not completed etc.

NB: inc_count gets sent spider and reason which we do not use in this example but you may wish to use those variables: they are sent from the signal dispatcher and are the spider which closed and the reason (str) for it closing.

Scrapy version: v0.24.5

like image 133
Darian Moody Avatar answered Oct 01 '22 17:10

Darian Moody


What I usually do, in PySide (I use QNetworkAccessManager and many self created workers for scrapping) is to mantain a counter of how many workers have finished processing work from the queue, when this counter reach the number of created workers, a signal is triggered to indicate that there is no more work to do and the application can do something else (like enabling a "export" button so the user can export it's result to a file, etc). Of course, this counter have to be inside a method and have to be called upon a signal is emitted by the crawler/spider/worker.

It might not be a elegant way of fixing your problem, but, Have you tried this anyway?

like image 32
shackra Avatar answered Oct 01 '22 19:10

shackra