How to run Scrapy from within a Python script

Tags:

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script.  #  # The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance. #  # [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.   #!/usr/bin/python import os os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports  from scrapy import log, signals, project from scrapy.xlib.pydispatch import dispatcher from scrapy.conf import settings from scrapy.crawler import CrawlerProcess from multiprocessing import Process, Queue  class CrawlerScript():      def __init__(self):         self.crawler = CrawlerProcess(settings)         if not hasattr(project, 'crawler'):             self.crawler.install()         self.crawler.configure()         self.items = []         dispatcher.connect(self._item_passed, signals.item_passed)      def _item_passed(self, item):         self.items.append(item)      def _crawl(self, queue, spider_name):         spider = self.crawler.spiders.create(spider_name)         if spider:             self.crawler.queue.append_spider(spider)         self.crawler.start()         self.crawler.stop()         queue.put(self.items)      def crawl(self, spider):         queue = Queue()         p = Process(target=self._crawl, args=(queue, spider,))         p.start()         p.join()         return queue.get(True)  # Usage if __name__ == "__main__":     log.start()      """     This example runs spider1 and then spider2 three times.      """     items = list()     crawler = CrawlerScript()     items.append(crawler.crawl('spider1'))     for i in range(3):         items.append(crawler.crawl('spider2'))     print items  # Snippet imported from snippets.scrapy.org (which no longer works) # author: joehillen # date  : Oct 24, 2010

Thank you.

648

asked Nov 18 '12 04:11

user47954

1 Answers

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands:

import scrapy from scrapy.crawler import CrawlerProcess  class MySpider(scrapy.Spider):     # Your spider definition     ...  process = CrawlerProcess({     'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' })  process.crawl(MySpider) process.start() # the script will block here until the crawling is finished

124

answered Sep 20 '22 20:09

danielmhanover

Related questions
                            
                                Why does next raise a 'StopIteration', but 'for' do a normal return?
                            
                                Efficient thresholding filter of an array with numpy
                            
                                set environment variable in python script
                            
                                What is the difference between pickle and shelve?
                            
                                Opposite of melt in python pandas
                            
                                Running Python from Atom
                            
                                How to access a field of a namedtuple using a variable for the field name?
                            
                                Django Model Mixins: inherit from models.Model or from object?
                            
                                c++11 regex slower than python
                            
                                Pandas Groupby and Sum Only One Column
                            
                                Convert Bytes to Floating Point Numbers?
                            
                                How does one monkey patch a function in python?
                            
                                Remove NaN from pandas series
                            
                                How can I retrieve the current seed of NumPy's random number generator?
                            
                                ValueError: invalid literal for int () with base 10
                            
                                Multiprocessing example giving AttributeError
                            
                                How to make an "always relative to current module" file path?
                            
                                Flask example with POST
                            
                                How to convert a list of numbers to jsonarray in Python
                            
                                StaleElementReferenceException on Python Selenium

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to run Scrapy from within a Python script

Tags:

python

web-scraping

scrapy

web-crawler

user47954

People also ask

1 Answers

danielmhanover

Recent Activity

Donate For Us