Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django custom management command running Scrapy: How to include Scrapy's options?

I want to be able to run the Scrapy web crawling framework from within Django. Scrapy itself only provides a command line tool scrapy to execute its commands, i.e. the tool was not intentionally written to be called from an external program.

The user Mikhail Korobov came up with a nice solution, namely to call Scrapy from a Django custom management command. For convenience, I repeat his solution here:

# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py 

from __future__ import absolute_import
from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def run_from_argv(self, argv):
        self._argv = argv
        return super(Command, self).run_from_argv(argv)

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        execute(self._argv[1:])

Instead of calling e.g. scrapy crawl domain.com I can now do python manage.py scrapy crawl domain.com from within a Django project. However, the options of a Scrapy command are not parsed at all. If I do python manage.py scrapy crawl domain.com -o scraped_data.json -t json, I only get the following response:

Usage: manage.py scrapy [options] 

manage.py: error: no such option: -o

So my question is, how to extend the custom management command to adopt Scrapy's command line options?

Unfortunately, Django's documentation of this part is not very extensive. I've also read the documentation of Python's optparse module but afterwards it was not clearer to me. Can anyone help me in this respect? Thanks a lot in advance!

like image 576
pemistahl Avatar asked May 12 '12 13:05

pemistahl


2 Answers

Okay, I have found a solution to my problem. It's a bit ugly but it works. Since the Django project's manage.py command does not accept Scrapy's command line options, I split the options string into two arguments which are accepted by manage.py. After successful parsing, I rejoin the two arguments and pass them to Scrapy.

That is, instead of writing

python manage.py scrapy crawl domain.com -o scraped_data.json -t json

I put spaces in between the options like this

python manage.py scrapy crawl domain.com - o scraped_data.json - t json

My handle function looks like this:

def handle(self, *args, **options):
    arguments = self._argv[1:]
    for arg in arguments:
        if arg in ('-', '--'):
            i = arguments.index(arg)
            new_arg = ''.join((arguments[i], arguments[i+1]))
            del arguments[i:i+2]
            arguments.insert(i, new_arg)

    from scrapy.cmdline import execute
    execute(arguments)

Meanwhile, Mikhail Korobov has provided the optimal solution. See here:

# -*- coding: utf-8 -*- 
# myapp/management/commands/scrapy.py 

from __future__ import absolute_import
from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        execute(self._argv[1:])
like image 62
pemistahl Avatar answered Oct 02 '22 22:10

pemistahl


I think you're really looking for Guideline 10 of the POSIX argument syntax conventions:

The argument -- should be accepted as a delimiter indicating the end of options. Any following arguments should be treated as operands, even if they begin with the '-' character. The -- argument should not be used as an option or as an operand.

Python's optparse module behaves this way, even under windows.

I put the scrapy project settings module in the argument list, so I can create separate scrapy projects in independent apps:

# <app>/management/commands/scrapy.py
from __future__ import absolute_import
import os

from django.core.management.base import BaseCommand

class Command(BaseCommand):
    def handle(self, *args, **options):
        os.environ['SCRAPY_SETTINGS_MODULE'] = args[0]
        from scrapy.cmdline import execute
        # scrapy ignores args[0], requires a mutable seq
        execute(list(args))

Invoked as follows:

python manage.py scrapy myapp.scrapyproj.settings crawl domain.com -- -o scraped_data.json -t json

Tested with scrapy 0.12 and django 1.3.1

like image 38
Aryeh Leib Taurog Avatar answered Oct 02 '22 22:10

Aryeh Leib Taurog