Running Scrapy spiders in a Celery task

Tags:

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users.

Something like this:

class StandAloneSpider(Spider):     #a regular spider  settings.overrides['LOG_ENABLED'] = True #more settings can be changed...  crawler = CrawlerProcess( settings ) crawler.install() crawler.configure()  spider = StandAloneSpider()  crawler.crawl( spider ) crawler.start()

I've decided to use Celery and use workers to queue up the crawl requests.

However, I'm running into issues with Tornado reactors not being able to restart. The first and second spider runs successfully, but subsequent spiders will throw the ReactorNotRestartable error.

Anyone can share any tips with running Spiders within the Celery framework?

440

asked Jul 17 '12 18:07

stryderjzw

1 Answers

Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen's code located here http://snippets.scrapy.org/snippets/13/

First the tasks.py file

from celery import task  @task() def crawl_domain(domain_pk):     from crawl import domain_crawl     return domain_crawl(domain_pk)

Then the crawl.py file

from multiprocessing import Process from scrapy.crawler import CrawlerProcess from scrapy.conf import settings from spider import DomainSpider from models import Domain  class DomainCrawlerScript():      def __init__(self):         self.crawler = CrawlerProcess(settings)         self.crawler.install()         self.crawler.configure()      def _crawl(self, domain_pk):         domain = Domain.objects.get(             pk = domain_pk,         )         urls = []         for page in domain.pages.all():             urls.append(page.url())         self.crawler.crawl(DomainSpider(urls))         self.crawler.start()         self.crawler.stop()      def crawl(self, domain_pk):         p = Process(target=self._crawl, args=[domain_pk])         p.start()         p.join()  crawler = DomainCrawlerScript()  def domain_crawl(domain_pk):     crawler.crawl(domain_pk)

The trick here is the "from multiprocessing import Process" this gets around the "ReactorNotRestartable" issue in the Twisted framework. So basically the Celery task calls the "domain_crawl" function which reuses the "DomainCrawlerScript" object over and over to interface with your Scrapy spider. (I am aware that my example is a little redundant but I did do this for a reason in my setup with multiple versions of python [my django webserver is actually using python2.4 and my worker servers use python2.7])

In my example here "DomainSpider" is just a modified Scrapy Spider that takes a list of urls in then sets them as the "start_urls".

Hope this helps!

answered Sep 25 '22 13:09

byoungb

Related questions
                            
                                how to apply a mask from one array to another array?
                            
                                Fastest way to process a large file?
                            
                                Python metaclasses vs class decorators
                            
                                Python URLLib / URLLib2 POST
                            
                                Relative importing modules from parent folder subfolder
                            
                                What does the Python version line mean?
                            
                                How do I make pytest fixtures work with decorated functions?
                            
                                Why do we need locks for threads, if we have GIL?
                            
                                Should enum instances be compared by identity or equality?
                            
                                How to multiply two vector and get a matrix?
                            
                                Creating large Pandas DataFrames: preallocation vs append vs concat
                            
                                In python, super() is always called first in a method. Are there situations where it should be called later?
                            
                                Python: TypeError: Unicode-objects must be encoded before hashing
                            
                                Are list comprehensions syntactic sugar for `list(generator expression)` in Python 3?
                            
                                How do I extract a sub-array from a numpy 2d array? [duplicate]
                            
                                What does asyncio.create_task() do?
                            
                                Return a list of imported Python modules used in a script?
                            
                                How to assign a local file to the FileField in Django?
                            
                                Which Python IDE can run my script line-by-line?
                            
                                How to open (read-write) or create a file with truncation allowed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Running Scrapy spiders in a Celery task

Tags:

python

django

scrapy

celery

stryderjzw

People also ask

1 Answers

byoungb

Recent Activity

Donate For Us