I want to be able to run the Scrapy web crawling framework from within Django. Scrapy itself only provides a command line tool scrapy
to execute its commands, i.e. the tool was not intentionally written to be called from an external program.
The user Mikhail Korobov came up with a nice solution, namely to call Scrapy from a Django custom management command. For convenience, I repeat his solution here:
# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py
from __future__ import absolute_import
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def run_from_argv(self, argv):
self._argv = argv
return super(Command, self).run_from_argv(argv)
def handle(self, *args, **options):
from scrapy.cmdline import execute
execute(self._argv[1:])
Instead of calling e.g. scrapy crawl domain.com
I can now do python manage.py scrapy crawl domain.com
from within a Django project. However, the options of a Scrapy command are not parsed at all. If I do python manage.py scrapy crawl domain.com -o scraped_data.json -t json
, I only get the following response:
Usage: manage.py scrapy [options]
manage.py: error: no such option: -o
So my question is, how to extend the custom management command to adopt Scrapy's command line options?
Unfortunately, Django's documentation of this part is not very extensive. I've also read the documentation of Python's optparse module but afterwards it was not clearer to me. Can anyone help me in this respect? Thanks a lot in advance!
Okay, I have found a solution to my problem. It's a bit ugly but it works. Since the Django project's manage.py
command does not accept Scrapy's command line options, I split the options string into two arguments which are accepted by manage.py
. After successful parsing, I rejoin the two arguments and pass them to Scrapy.
That is, instead of writing
python manage.py scrapy crawl domain.com -o scraped_data.json -t json
I put spaces in between the options like this
python manage.py scrapy crawl domain.com - o scraped_data.json - t json
My handle function looks like this:
def handle(self, *args, **options):
arguments = self._argv[1:]
for arg in arguments:
if arg in ('-', '--'):
i = arguments.index(arg)
new_arg = ''.join((arguments[i], arguments[i+1]))
del arguments[i:i+2]
arguments.insert(i, new_arg)
from scrapy.cmdline import execute
execute(arguments)
Meanwhile, Mikhail Korobov has provided the optimal solution. See here:
# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py
from __future__ import absolute_import
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def run_from_argv(self, argv):
self._argv = argv
self.execute()
def handle(self, *args, **options):
from scrapy.cmdline import execute
execute(self._argv[1:])
I think you're really looking for Guideline 10 of the POSIX argument syntax conventions:
The argument -- should be accepted as a delimiter indicating the end of options. Any following arguments should be treated as operands, even if they begin with the '-' character. The -- argument should not be used as an option or as an operand.
Python's optparse
module behaves this way, even under windows.
I put the scrapy project settings module in the argument list, so I can create separate scrapy projects in independent apps:
# <app>/management/commands/scrapy.py
from __future__ import absolute_import
import os
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def handle(self, *args, **options):
os.environ['SCRAPY_SETTINGS_MODULE'] = args[0]
from scrapy.cmdline import execute
# scrapy ignores args[0], requires a mutable seq
execute(list(args))
Invoked as follows:
python manage.py scrapy myapp.scrapyproj.settings crawl domain.com -- -o scraped_data.json -t json
Tested with scrapy 0.12 and django 1.3.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With