I'm trying to run scrapy from a script as discussed here. It suggested using this snippet, but when I do it hangs indefinitely. This was written back in version .10; is it still compatible with the current stable?
The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from scrapy.http import Request
def handleSpiderIdle(spider):
'''Handle spider idle event.''' # http://doc.scrapy.org/topics/signals.html#spider-idle
print '\nSpider idle: %s. Restarting it... ' % spider.name
for url in spider.start_urls: # reschedule start urls
spider.crawler.engine.crawl(Request(url, dont_filter=True), spider)
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': 'mybot.pipeline.validate.ValidateMyItem'} # global settings http://doc.scrapy.org/topics/settings.html
settings.overrides.update(mySettings)
crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()
class MySpider(BaseSpider):
start_urls = ['http://site_to_scrape']
def parse(self, response):
yield item
spider = MySpider() # create a spider ourselves
crawlerProcess.queue.append_spider(spider) # add it to spiders pool
dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)
log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."
UPDATE:
If you need to have also settings per spider see this example:
for spiderConfig in spiderConfigs:
spiderConfig = spiderConfig.copy() # a dictionary similar to the one with global settings above
spiderName = spiderConfig.pop('name') # name of the spider is in the configs - i can use the same spider in several instances - giving them different names
spiderModuleName = spiderConfig.pop('spiderClass') # module with the spider is in the settings
spiderModule = __import__(spiderModuleName, {}, {}, ['']) # import that module
SpiderClass = spiderModule.Spider # spider class is named 'Spider'
spider = SpiderClass(name = spiderName, **spiderConfig) # create the spider with given particular settings
crawlerProcess.queue.append_spider(spider) # add the spider to spider pool
Example of settings in the file for spiders:
name = punderhere_com
allowed_domains = plunderhere.com
spiderClass = scraper.spiders.plunderhere_com
start_urls = http://www.plunderhere.com/categories.php?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With