Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running Scrapy spiders in a Celery task

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users.

Something like this:

class StandAloneSpider(Spider):     #a regular spider  settings.overrides['LOG_ENABLED'] = True #more settings can be changed...  crawler = CrawlerProcess( settings ) crawler.install() crawler.configure()  spider = StandAloneSpider()  crawler.crawl( spider ) crawler.start() 

I've decided to use Celery and use workers to queue up the crawl requests.

However, I'm running into issues with Tornado reactors not being able to restart. The first and second spider runs successfully, but subsequent spiders will throw the ReactorNotRestartable error.

Anyone can share any tips with running Spiders within the Celery framework?

like image 440
stryderjzw Avatar asked Jul 17 '12 18:07

stryderjzw


People also ask

How do you run multiple spiders in a Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

How do you run a Scrapy spider from a Python script?

The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.

How does celery execute tasks?

Celery workers are worker processes that run tasks independently from one another and outside the context of your main service. Celery beat is a scheduler that orchestrates when to run tasks. You can use it to schedule periodic tasks as well.

How do I add a task to celery?

All tasks must be imported during Django and Celery startup so that Celery knows about them. If we put them in <appname>/tasks.py files and call app. autodiscover_tasks(), that will do it. Or we could put our tasks in our models files, or import them from there, or import them from application ready methods.


1 Answers

Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen's code located here http://snippets.scrapy.org/snippets/13/

First the tasks.py file

from celery import task  @task() def crawl_domain(domain_pk):     from crawl import domain_crawl     return domain_crawl(domain_pk) 

Then the crawl.py file

from multiprocessing import Process from scrapy.crawler import CrawlerProcess from scrapy.conf import settings from spider import DomainSpider from models import Domain  class DomainCrawlerScript():      def __init__(self):         self.crawler = CrawlerProcess(settings)         self.crawler.install()         self.crawler.configure()      def _crawl(self, domain_pk):         domain = Domain.objects.get(             pk = domain_pk,         )         urls = []         for page in domain.pages.all():             urls.append(page.url())         self.crawler.crawl(DomainSpider(urls))         self.crawler.start()         self.crawler.stop()      def crawl(self, domain_pk):         p = Process(target=self._crawl, args=[domain_pk])         p.start()         p.join()  crawler = DomainCrawlerScript()  def domain_crawl(domain_pk):     crawler.crawl(domain_pk) 

The trick here is the "from multiprocessing import Process" this gets around the "ReactorNotRestartable" issue in the Twisted framework. So basically the Celery task calls the "domain_crawl" function which reuses the "DomainCrawlerScript" object over and over to interface with your Scrapy spider. (I am aware that my example is a little redundant but I did do this for a reason in my setup with multiple versions of python [my django webserver is actually using python2.4 and my worker servers use python2.7])

In my example here "DomainSpider" is just a modified Scrapy Spider that takes a list of urls in then sets them as the "start_urls".

Hope this helps!

like image 50
byoungb Avatar answered Sep 25 '22 13:09

byoungb