Run a Scrapy spider in a Celery Task

Tags:

This is not working anymore, scrapy's API has changed.

Now the documentation feature a way to "Run Scrapy from a script" but I get the ReactorNotRestartable error.

My task:

from celery import Task  from twisted.internet import reactor  from scrapy.crawler import Crawler from scrapy import log, signals from scrapy.utils.project import get_project_settings  from .spiders import MySpider    class MyTask(Task):     def run(self, *args, **kwargs):         spider = MySpider         settings = get_project_settings()         crawler = Crawler(settings)         crawler.signals.connect(reactor.stop, signal=signals.spider_closed)         crawler.configure()         crawler.crawl(spider)         crawler.start()          log.start()         reactor.run()

683

asked Mar 01 '14 15:03

Juan Riaza

2 Answers

The twisted reactor cannot be restarted. A work around for this is to let the celery task fork a new child process for each crawl you want to execute as proposed in the following post:

Running Scrapy spiders in a Celery task

This gets around the "reactor cannot be restart-able" issue by utilizing the multiprocessing package. But the problem with this is that the workaround is now obsolete with the latest celery version due to the fact that you will instead run into another issue where a daemon process can't spawn sub processes. So in order for the workaround to work you need to go down in celery version.

Yes, and the scrapy API has changed. But with minor modifications (import Crawler instead of CrawlerProcess). You can get the workaround to work by going down in celery version.

The Celery issue can be found here: Celery Issue #1709

Here is my updated crawl-script that works with newer celery versions by utilizing billiard instead of multiprocessing:

from scrapy.crawler import Crawler from scrapy.conf import settings from myspider import MySpider from scrapy import log, project from twisted.internet import reactor from billiard import Process from scrapy.utils.project import get_project_settings from scrapy import signals   class UrlCrawlerScript(Process):     def __init__(self, spider):         Process.__init__(self)         settings = get_project_settings()         self.crawler = Crawler(settings)         self.crawler.configure()         self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)         self.spider = spider      def run(self):         self.crawler.crawl(self.spider)         self.crawler.start()         reactor.run()  def run_spider(url):     spider = MySpider(url)     crawler = UrlCrawlerScript(spider)     crawler.start()     crawler.join()

Edit: By reading the celery issue #1709 they suggest to use billiard instead of multiprocessing in order for the subprocess limitation to be lifted. In other words we should try billiard and see if it works!

Edit 2: Yes, by using billiard, my script works with the latest celery build! See my updated script.

132

answered Sep 29 '22 07:09

Bj Blazkowicz

The Twisted reactor cannot be restarted, so once one spider finishes running and crawler stops the reactor implicitly, that worker is useless.

As posted in the answers to that other question, all you need to do is kill the worker which ran your spider and replace it with a fresh one, which prevents the reactor from being started and stopped more than once. To do this, just set:

CELERYD_MAX_TASKS_PER_CHILD = 1

The downside is that you're not really using the Twisted reactor to its full potential and wasting resources running multiple reactors, as one reactor can run multiple spiders at once in a single process. A better approach is to run one reactor per worker (or even one reactor globally) and don't let crawler touch it.

I'm working on this for a very similar project, so I'll update this post if I make any progress.

answered Sep 29 '22 06:09

Blender

Related questions
                            
                                scrapy: convert html string to HtmlResponse object
                            
                                suppress Scrapy Item printed in logs after pipeline
                            
                                Is it possible to pass a variable from start_requests() to parse() for each individual request?
                            
                                Scrapy - Reactor not Restartable [duplicate]
                            
                                scrapy - parsing items that are paginated
                            
                                Send Post Request in Scrapy
                            
                                scraping the file with html saved in local system
                            
                                CrawlerProcess vs CrawlerRunner
                            
                                Passing a argument to a callback function
                            
                                python No module named service_identity [duplicate]
                            
                                Scrapy: how to disable or change log?
                            
                                Best way for a beginner to learn screen scraping by Python [closed]
                            
                                Crawling with an authenticated session in Scrapy
                            
                                Access django models inside of Scrapy
                            
                                Passing arguments to process.crawl in Scrapy python
                            
                                unknown command: crawl error
                            
                                How to access scrapy settings from item Pipeline
                            
                                Scrapy throws ImportError: cannot import name xmlrpc_client
                            
                                ReactorNotRestartable error in while loop with scrapy
                            
                                How to give URL to scrapy for crawling?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Run a Scrapy spider in a Celery Task

Tags:

scrapy

celery

twisted

Juan Riaza

People also ask

2 Answers

Bj Blazkowicz

Blender

Recent Activity

Donate For Us