CrawlerProcess vs CrawlerRunner

1 Answers

Scrapy's documentation does a pretty bad job at giving examples on real applications of both.

CrawlerProcess assumes that scrapy is the only thing that is going to use twisted's reactor. If you are using threads in python to run other code this isn't always true. Let's take this as an example.

from scrapy.crawler import CrawlerProcess import scrapy def notThreadSafe(x):     """do something that isn't thread-safe"""     # ... class MySpider1(scrapy.Spider):     # Your first spider definition     ...  class MySpider2(scrapy.Spider):     # Your second spider definition     ...  process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished notThreadSafe(3) # it will get executed when the crawlers stop

Now, as you can see, the function will only get executed when the crawlers stop, what if I want the function to be executed while the crawlers crawl in the same reactor?

from twisted.internet import reactor from scrapy.crawler import CrawlerRunner import scrapy  def notThreadSafe(x):     """do something that isn't thread-safe"""     # ...  class MySpider1(scrapy.Spider):     # Your first spider definition     ...  class MySpider2(scrapy.Spider):     # Your second spider definition     ... runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.callFromThread(notThreadSafe, 3) reactor.run() #it will run both crawlers and code inside the function

The Runner class is not limited to this functionality, you may want some custom settings on your reactor (defer, threads, getPage, custom error reporting, etc)

answered Sep 16 '22 21:09

Rafael Almeida

Related questions
                            
                                In Python, what is `sys.maxsize`?
                            
                                What is difference between en_core_web_sm, en_core_web_md and en_core_web_lg model of spacy?
                            
                                How to implement "autoincrement" on Google AppEngine
                            
                                sigmoidal regression with scipy, numpy, python, etc
                            
                                What's the best way of skip N values of the iteration variable in Python?
                            
                                Is there a Python equivalent to C#'s DateTime.TryParse()?
                            
                                plot data from CSV file with matplotlib
                            
                                How do I get warnings.warn to issue a warning and not ignore the line?
                            
                                Matplotlib - fixing x axis scale and autoscale y axis
                            
                                Python - why can I import modules without __init__.py at all?
                            
                                "Piping" output from one function to another using Python infix syntax
                            
                                Tensorflow: Multiple loss functions vs Multiple training ops
                            
                                How do I use prepared statements for inserting MULTIPLE records in SQlite using Python / Django?
                            
                                Why implicitly check for emptiness in Python? [closed]
                            
                                Is it ever useful to use Python's input over raw_input?
                            
                                error extracting element from an array. python
                            
                                Pythonic way for `return (value == 'ok') ? 'ok' : 'nok' ` [duplicate]
                            
                                In python, how to do unit test on a function without return value?
                            
                                Pandas: Checking if a date is a holiday and assigning boolean value
                            
                                SQLalchemy AttributeError: 'str' object has no attribute '_sa_instance_state'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CrawlerProcess vs CrawlerRunner

Tags:

python

web-scraping

scrapy

alecxe

People also ask

1 Answers

Rafael Almeida

Recent Activity

Donate For Us