Running Scrapy from a script - Hangs

Tags:

I'm trying to run scrapy from a script as discussed here. It suggested using this snippet, but when I do it hangs indefinitely. This was written back in version .10; is it still compatible with the current stable?

866

asked Jun 27 '11 14:06

ciferkey

1 Answers

from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from scrapy.http import Request

def handleSpiderIdle(spider):
    '''Handle spider idle event.''' # http://doc.scrapy.org/topics/signals.html#spider-idle
    print '\nSpider idle: %s. Restarting it... ' % spider.name
    for url in spider.start_urls: # reschedule start urls
        spider.crawler.engine.crawl(Request(url, dont_filter=True), spider)

mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': 'mybot.pipeline.validate.ValidateMyItem'} # global settings http://doc.scrapy.org/topics/settings.html

settings.overrides.update(mySettings)

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

class MySpider(BaseSpider):
    start_urls = ['http://site_to_scrape']
    def parse(self, response):
        yield item

spider = MySpider() # create a spider ourselves
crawlerProcess.queue.append_spider(spider) # add it to spiders pool

dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)

log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."

UPDATE:

If you need to have also settings per spider see this example:

for spiderConfig in spiderConfigs:
    spiderConfig = spiderConfig.copy() # a dictionary similar to the one with global settings above
    spiderName = spiderConfig.pop('name') # name of the spider is in the configs - i can use the same spider in several instances - giving them different names
    spiderModuleName = spiderConfig.pop('spiderClass') # module with the spider is in the settings
    spiderModule = __import__(spiderModuleName, {}, {}, ['']) # import that module
    SpiderClass = spiderModule.Spider # spider class is named 'Spider'
    spider = SpiderClass(name = spiderName, **spiderConfig) # create the spider with given particular settings
    crawlerProcess.queue.append_spider(spider) # add the spider to spider pool

Example of settings in the file for spiders:

name = punderhere_com    
allowed_domains = plunderhere.com
spiderClass = scraper.spiders.plunderhere_com
start_urls = http://www.plunderhere.com/categories.php?

158

answered Oct 17 '22 00:10

warvariuc

Related questions
                            
                                Using Sphinx with a distutils-built C extension
                            
                                Populating a PDF file - Python
                            
                                check for valid arguments
                            
                                How to achieve desired results when using the subprocees Popen.send_signal(CTRL_C_EVENT) in Windows?
                            
                                Nested Dictionary/Array in C++
                            
                                gaierror: [Errno -2] Name or service not known
                            
                                Serializing SQLAlchemy models for a REST API while respecting access control?
                            
                                Testing django internationalization - Mocking gettext
                            
                                How to calculate tag-wise precision and recall for POS tagger?
                            
                                why use wsgiref simple_server?
                            
                                Python IDLE equivalent of CTRL-R in R
                            
                                Python twisted framework multicast bind on specific interface
                            
                                What is the reason behind the advice that the substrings in regex should be ordered based on length?
                            
                                How to display the redirected stdin in Python?
                            
                                ImportError: cannot import name signals
                            
                                Python: dynamically add attributes to new-style class/obj
                            
                                Communication between parent child processes
                            
                                Need to create a list of sets, from a list of sets whose members may be connected
                            
                                Output from subprocess.Popen
                            
                                Calculating plugin dependencies

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Running Scrapy from a script - Hangs

Tags:

python

scrapy

ciferkey

People also ask

1 Answers

warvariuc

Recent Activity

Donate For Us