Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

building scrapy spiders into my own program, i don't want to call scrapy from command line)

in a similar vein to this question: stackoverflow: running-multiple-spiders-in-scrapy

I am wondering, can I run a entire scrapy project from within another python program? Lets just say I wanted to build a entire program that required scraping several different sites, and I build entire scrapy projects for each site.

instead of running from command line as a one of, I want to run these spiders and acquire the information from them.

I can use mongoDB in python ok, and I can already build scrapy projects that contain spiders, but now just merging it all into one application.

I want to run the application once, and have the ability to control multiple spiders from my own program

Why do this? well this application may also connect to other sites using a API and needs to compare results from the API site to the scraped site in real time. I don't want to ever have to call scrapy from the command line, its all meant to be self contained.

(I have been asking lots of questions about scraping recently, because I am trying to find the right solution to build in)

Thanks :)

like image 370
Joseph Avatar asked Jun 28 '12 09:06

Joseph


People also ask

How do you use Scrapy in CMD?

Using the scrapy tool You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]

How do you run a Scrapy spider in terminal?

Using Scrapy Toolfetch − It fetches the URL using Scrapy downloader. runspider − It is used to run self-contained spider without creating a project. settings − It specifies the project setting value. shell − It is an interactive scraping module for the given URL.

How do you run a Scrapy in a script?

Basic ScriptThe key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.


1 Answers

Yep, of course you can ;)

The idea (inspired from this blog post) is to create a worker and then use it in your own Python script:

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
import multiprocessing

class CrawlerWorker(multiprocessing.Process):

    def __init__(self, spider, result_queue):
        multiprocessing.Process.__init__(self)
        self.result_queue = result_queue

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.result_queue.put(self.items)

Example of use:

result_queue = Queue()
crawler = CrawlerWorker(MySpider(myArgs), result_queue)
crawler.start()
for item in result_queue.get():
    yield item

Another way would be to execute the scrapy crawl command with system()

like image 188
Maxime Lorant Avatar answered Dec 08 '22 06:12

Maxime Lorant