Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script:
CrawlerProcess
CrawlerRunner
What is the difference between the two? When should I use "process" and when "runner"?
CrawlerProcess . This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands. Here's an example showing how to run a single spider with it. import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
Scrapy, an open-source scraper framework written in Python, is one of the most popular choices for such purpose. After writing a handful of scrapers for our projects, I learnt to use some tricks to write efficient scrapers. One of them is to “utilize” the target sites' own REST API.
Scrapy's documentation does a pretty bad job at giving examples on real applications of both.
CrawlerProcess
assumes that scrapy is the only thing that is going to use twisted's reactor. If you are using threads in python to run other code this isn't always true. Let's take this as an example.
from scrapy.crawler import CrawlerProcess import scrapy def notThreadSafe(x): """do something that isn't thread-safe""" # ... class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished notThreadSafe(3) # it will get executed when the crawlers stop
Now, as you can see, the function will only get executed when the crawlers stop, what if I want the function to be executed while the crawlers crawl in the same reactor?
from twisted.internet import reactor from scrapy.crawler import CrawlerRunner import scrapy def notThreadSafe(x): """do something that isn't thread-safe""" # ... class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.callFromThread(notThreadSafe, 3) reactor.run() #it will run both crawlers and code inside the function
The Runner class is not limited to this functionality, you may want some custom settings on your reactor (defer, threads, getPage, custom error reporting, etc)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With