building scrapy spiders into my own program, i don't want to call scrapy from command line)

Tags:

in a similar vein to this question: stackoverflow: running-multiple-spiders-in-scrapy

I am wondering, can I run a entire scrapy project from within another python program? Lets just say I wanted to build a entire program that required scraping several different sites, and I build entire scrapy projects for each site.

instead of running from command line as a one of, I want to run these spiders and acquire the information from them.

I can use mongoDB in python ok, and I can already build scrapy projects that contain spiders, but now just merging it all into one application.

I want to run the application once, and have the ability to control multiple spiders from my own program

Why do this? well this application may also connect to other sites using a API and needs to compare results from the API site to the scraped site in real time. I don't want to ever have to call scrapy from the command line, its all meant to be self contained.

(I have been asking lots of questions about scraping recently, because I am trying to find the right solution to build in)

Thanks :)

370

asked Jun 28 '12 09:06

Joseph

1 Answers

Yep, of course you can ;)

The idea (inspired from this blog post) is to create a worker and then use it in your own Python script:

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
import multiprocessing

class CrawlerWorker(multiprocessing.Process):

    def __init__(self, spider, result_queue):
        multiprocessing.Process.__init__(self)
        self.result_queue = result_queue

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.result_queue.put(self.items)

Example of use:

result_queue = Queue()
crawler = CrawlerWorker(MySpider(myArgs), result_queue)
crawler.start()
for item in result_queue.get():
    yield item

Another way would be to execute the scrapy crawl command with system()

188

answered Dec 08 '22 06:12

Maxime Lorant

Related questions
                            
                                Why has the numpy random.choice() function been discontinued?
                            
                                How to access properties of Python super classes e.g. via __class__.__dict__?
                            
                                Restricting values for curve_fit (scipy.optimize)
                            
                                Floating point numbers of Python "float" and PostgreSQL "double precision"
                            
                                How to distribute "statically compiled " Python application with dependencies
                            
                                compiling vim with python support on Ubuntu
                            
                                Is docstring max line-length different to normal PEP8 standard?
                            
                                Efficiently sum a small numpy array, broadcast across a ginormous numpy array?
                            
                                PyList_SetItem vs. PyList_SETITEM
                            
                                Deploying to multiple EC2 servers with Fabric
                            
                                OpenCV dot target detection not finding all targets, and found circles are offset
                            
                                scatterplot with xerr and yerr with matplotlib
                            
                                Python SL4A Development
                            
                                distutils, distutils2, pip and requirements [closed]
                            
                                How to Add an icon to an ubuntu app
                            
                                Using GitPython module to get remote HEAD branch
                            
                                Add a 2 value tuple to dict as key:value
                            
                                How do I re-draw only one set of axes on a figure in python?
                            
                                General synonym and part of speech processing using nltk
                            
                                How to install python modules in blender

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

building scrapy spiders into my own program, i don't want to call scrapy from command line)

Tags:

python

web-scraping

scrapy

Joseph

People also ask

1 Answers

Maxime Lorant

Recent Activity

Donate For Us