What is the simplest way to programatically start a crawler in Scrapy = 0.14

Question

I want to start a crawler in Scrapy from a Python module. I want to essentially mimic the essence of $ scrapy crawl my_crawler -a some_arg=value -L DEBUG

I have the following things in place:

a settings.py file for the project
items and pipelines
a crawler class which extends BaseSpider and requires arguments upon initialisation.

I can quite happily run my project using the scrapy command as specified above, however I'm writing integration tests and I want to programatically:

launch the crawl using the settings in settings.py and the crawler that has the my_crawler name attribute (I can instantiate this class easily from my test module.
I want all the pipelines and middleware to be used as per the specification in settings.py.
I'm quite happy for the process to be blocked until the crawler has finished. The pipelines dump things in a DB and it's the contents of the DB I'll be inspecting after the crawl is done to satisfy my tests.

So, can anyone help me? I've seen some examples on the net but they are either hacks for multiple spiders, or getting around Twisted's blocking nature, or don't work with Scrapy 0.14 or above. I just need something real simple. :-)

Wilfred Hughes · Accepted Answer

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

See this part of the docs

What is the simplest way to programatically start a crawler in Scrapy >= 0.14

Tags:

python

web-scraping

scrapy

Edwardr

1 Answers

Wilfred Hughes

Recent Activity

Donate For Us