Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the simplest way to programatically start a crawler in Scrapy >= 0.14

I want to start a crawler in Scrapy from a Python module. I want to essentially mimic the essence of $ scrapy crawl my_crawler -a some_arg=value -L DEBUG

I have the following things in place:

  • a settings.py file for the project
  • items and pipelines
  • a crawler class which extends BaseSpider and requires arguments upon initialisation.

I can quite happily run my project using the scrapy command as specified above, however I'm writing integration tests and I want to programatically:

  • launch the crawl using the settings in settings.py and the crawler that has the my_crawler name attribute (I can instantiate this class easily from my test module.
  • I want all the pipelines and middleware to be used as per the specification in settings.py.
  • I'm quite happy for the process to be blocked until the crawler has finished. The pipelines dump things in a DB and it's the contents of the DB I'll be inspecting after the crawl is done to satisfy my tests.

So, can anyone help me? I've seen some examples on the net but they are either hacks for multiple spiders, or getting around Twisted's blocking nature, or don't work with Scrapy 0.14 or above. I just need something real simple. :-)

like image 932
Edwardr Avatar asked Jun 26 '12 18:06

Edwardr


1 Answers

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

See this part of the docs

like image 68
Wilfred Hughes Avatar answered Oct 15 '22 22:10

Wilfred Hughes