Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CrawlerProcess vs CrawlerRunner

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script:

  • using CrawlerProcess
  • using CrawlerRunner

What is the difference between the two? When should I use "process" and when "runner"?

like image 397
alecxe Avatar asked Sep 26 '16 14:09

alecxe


People also ask

What is CrawlerProcess?

CrawlerProcess . This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands. Here's an example showing how to run a single spider with it. import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.

How do you run multiple spiders in a Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

How do I run a Scrapy from a Python script?

The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.

What is Scrapy API?

Scrapy, an open-source scraper framework written in Python, is one of the most popular choices for such purpose. After writing a handful of scrapers for our projects, I learnt to use some tricks to write efficient scrapers. One of them is to “utilize” the target sites' own REST API.


1 Answers

Scrapy's documentation does a pretty bad job at giving examples on real applications of both.

CrawlerProcess assumes that scrapy is the only thing that is going to use twisted's reactor. If you are using threads in python to run other code this isn't always true. Let's take this as an example.

from scrapy.crawler import CrawlerProcess import scrapy def notThreadSafe(x):     """do something that isn't thread-safe"""     # ... class MySpider1(scrapy.Spider):     # Your first spider definition     ...  class MySpider2(scrapy.Spider):     # Your second spider definition     ...  process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished notThreadSafe(3) # it will get executed when the crawlers stop 

Now, as you can see, the function will only get executed when the crawlers stop, what if I want the function to be executed while the crawlers crawl in the same reactor?

from twisted.internet import reactor from scrapy.crawler import CrawlerRunner import scrapy  def notThreadSafe(x):     """do something that isn't thread-safe"""     # ...  class MySpider1(scrapy.Spider):     # Your first spider definition     ...  class MySpider2(scrapy.Spider):     # Your second spider definition     ... runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.callFromThread(notThreadSafe, 3) reactor.run() #it will run both crawlers and code inside the function 

The Runner class is not limited to this functionality, you may want some custom settings on your reactor (defer, threads, getPage, custom error reporting, etc)

like image 68
Rafael Almeida Avatar answered Sep 16 '22 21:09

Rafael Almeida