I have a script called algorithm.py and I want to be able to call Scrapy spiders during the script. The file scructure is:
algorithm.py MySpiders/
where MySpiders is a folder containing several scrapy projects. I would like to create methods perform_spider1(), perform_spider2()... which I can call in algorithm.py.
How do I construct this method?
I have managed to call one spider using the following code, however, it's not a method and it only works for one spider. I'm a beginner in need of help!
import sys,os.path
sys.path.append('path to spider1/spider1')
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from scrapy.xlib.pydispatch import dispatcher
from spider1.spiders.spider1_spider import Spider1Spider
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = RaListSpider()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run() # the script will block here
log.msg('Reactor stopped.')
Basic Script The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
CrawlerProcess . This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands. Here's an example showing how to run a single spider with it. import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
Just go through your spiders and set them up via calling configure
, crawl
and start
, and only then call log.start()
and reactor.run()
. And scrapy will run multiple spiders in the same process.
For more info see documentation and this thread.
Also, consider running your spiders via scrapyd.
Hope that helps.
Based on the good advice from alecxe, here is a possible solution.
import sys,os.path
sys.path.append('/path/ra_list/')
sys.path.append('/path/ra_event/')
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from scrapy.xlib.pydispatch import dispatcher
from ra_list.spiders.ra_list_spider import RaListSpider
from ra_event.spiders.ra_event_spider import RaEventSpider
spider_count = 0
number_of_spiders = 2
def stop_reactor_after_all_spiders():
global spider_count
spider_count = spider_count + 1
if spider_count == number_of_spiders:
reactor.stop()
dispatcher.connect(stop_reactor_after_all_spiders, signal=signals.spider_closed)
def crawl_resident_advisor():
global spider_count
spider_count = 0
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(RaListSpider())
crawler.start()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(RaEventSpider())
crawler.start()
log.start()
log.msg('Running in reactor...')
reactor.run() # the script will block here
log.msg('Reactor stopped.')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With