I want to start a crawler in Scrapy from a Python module. I want to essentially mimic the essence of $ scrapy crawl my_crawler -a some_arg=value -L DEBUG
I have the following things in place:
I can quite happily run my project using the scrapy
command as specified above, however I'm writing integration tests and I want to programatically:
settings.py
and the crawler that has the my_crawler
name attribute (I can instantiate this class easily from my test module.settings.py
.So, can anyone help me? I've seen some examples on the net but they are either hacks for multiple spiders, or getting around Twisted's
blocking nature, or don't work with Scrapy 0.14 or above. I just need something real simple. :-)
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
See this part of the docs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With