I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI.
How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamically configurable? E.g. I write the configuration to a file, and the spider reads it somehow.
WARNING: This answer was for Scrapy v0.7, spider manager api changed a lot since then.
Override default SpiderManager class, load your custom rules from a database or somewhere else and instanciate a custom spider with your own rules/regexes and domain_name
in mybot/settings.py:
SPIDER_MANAGER_CLASS = 'mybot.spidermanager.MySpiderManager'
in mybot/spidermanager.py:
from mybot.spider import MyParametrizedSpider
class MySpiderManager(object):
loaded = True
def fromdomain(self, name):
start_urls, extra_domain_names, regexes = self._get_spider_info(name)
return MyParametrizedSpider(name, start_urls, extra_domain_names, regexes)
def close_spider(self, spider):
# Put here code you want to run before spiders is closed
pass
def _get_spider_info(self, name):
# query your backend (maybe a sqldb) using `name` as primary key,
# and return start_urls, extra_domains and regexes
...
return (start_urls, extra_domains, regexes)
and now your custom spider class, in mybot/spider.py:
from scrapy.spider import BaseSpider
class MyParametrizedSpider(BaseSpider):
def __init__(self, name, start_urls, extra_domain_names, regexes):
self.domain_name = name
self.start_urls = start_urls
self.extra_domain_names = extra_domain_names
self.regexes = regexes
def parse(self, response):
...
Notes:
./scrapy-ctl.py crawl <name>
, where name
is passed to SpiderManager.fromdomain and is the key to retreive more spider info from the backend systemWhat you need is to dynamically create spider classes, subclassing your favorite generic spider class as supplied by scrapy
(CrawlSpider
subclasses with your rules
added, or XmlFeedSpider
, or whatever) and adding domain_name
, start_urls
, and possibly extra_domain_names
(and/or start_requests()
, etc), as you get or deduce them from your GUI (or config file, or whatever).
Python makes it easy to perform such dynamic creation of class objects; a very simple example might be:
from scrapy import spider
def makespider(domain_name, start_urls,
basecls=spider.BaseSpider):
return type(domain_name + 'Spider',
(basecls,),
{'domain_name': domain_name,
'start_urls': start_urls})
allspiders = []
for domain, urls in listofdomainurlpairs:
allspiders.append(makespider(domain, urls))
This gives you a list of very bare-bone spider classes -- you'll probably want to add parse
methods to them before you instantiate them. Season to taste...;-).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With