Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run Scrapy from within a Python script

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script.  #  # The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance. #  # [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.   #!/usr/bin/python import os os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports  from scrapy import log, signals, project from scrapy.xlib.pydispatch import dispatcher from scrapy.conf import settings from scrapy.crawler import CrawlerProcess from multiprocessing import Process, Queue  class CrawlerScript():      def __init__(self):         self.crawler = CrawlerProcess(settings)         if not hasattr(project, 'crawler'):             self.crawler.install()         self.crawler.configure()         self.items = []         dispatcher.connect(self._item_passed, signals.item_passed)      def _item_passed(self, item):         self.items.append(item)      def _crawl(self, queue, spider_name):         spider = self.crawler.spiders.create(spider_name)         if spider:             self.crawler.queue.append_spider(spider)         self.crawler.start()         self.crawler.stop()         queue.put(self.items)      def crawl(self, spider):         queue = Queue()         p = Process(target=self._crawl, args=(queue, spider,))         p.start()         p.join()         return queue.get(True)  # Usage if __name__ == "__main__":     log.start()      """     This example runs spider1 and then spider2 three times.      """     items = list()     crawler = CrawlerScript()     items.append(crawler.crawl('spider1'))     for i in range(3):         items.append(crawler.crawl('spider2'))     print items  # Snippet imported from snippets.scrapy.org (which no longer works) # author: joehillen # date  : Oct 24, 2010 

Thank you.

like image 648
user47954 Avatar asked Nov 18 '12 04:11

user47954


People also ask

How do you run a Scrapy project?

Using the scrapy tool Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...] The first line will print the currently active project if you're inside a Scrapy project.

How do you run a Scrapy in PyCharm?

The scrapy command is a python script which means you can start it from inside PyCharm. Create a run/debug configuration inside PyCharm with that script as script. Fill the script parameters with the scrapy command and spider. In this case crawl IcecatCrawler .

Can I run Scrapy on Jupyter notebook?

Jupyter Notebook is very popular amid data scientists among other options like PyCharm, zeppelin, VS Code, nteract, Google Colab, and spyder to name a few. Scraping using Scrapy is done with a . py file often. It can be also initialized from a Notebook.

Which is better Scrapy or BeautifulSoup?

Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.


1 Answers

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands:

import scrapy from scrapy.crawler import CrawlerProcess  class MySpider(scrapy.Spider):     # Your spider definition     ...  process = CrawlerProcess({     'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' })  process.crawl(MySpider) process.start() # the script will block here until the crawling is finished 
like image 124
danielmhanover Avatar answered Sep 20 '22 20:09

danielmhanover