Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to dynamically set Scrapy rules?

Tags:

python

scrapy

I have a class running some code before the init:

class NoFollowSpider(CrawlSpider):
    rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                callback="parse_items",  follow= True),
)

def __init__(self, moreparams=None, *args, **kwargs):
    super(NoFollowSpider, self).__init__(*args, **kwargs)
    self.moreparams = moreparams

I am running this scrapy code with the following command:

> scrapy runspider my_spider.py -a moreparams="more parameters" -o output.txt 

Now, I want the static variable named rules to be configurable from the command-line:

> scrapy runspider my_spider.py -a crawl=True -a moreparams="more parameters" -o output.txt

changing the init to:

def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
    if (crawl_pages is True):
        self.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items",  follow= True),
    )
    self.moreparams = moreparams

However, if I switch the static variable rules within the init, scrapy does not take it into account anymore: It runs, but only crawls the given start_urls and not the whole domain. It seems that rules must be a static class variable.

So, How can I dynamically set a static variable?

like image 255
Antoine Brunel Avatar asked Dec 08 '22 05:12

Antoine Brunel


1 Answers

So here is how I resolved the problem with the great help of @Not_a_Golfer and @nramirezuy, I'm simply using a bit of both what they suggested:

class NoFollowSpider(CrawlSpider):

def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
    super(NoFollowSpider, self).__init__(*args, **kwargs)
    # Set the class member from here
    if (crawl_pages is True):
        NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items",  follow= True),)
        # Then recompile the Rules
        super(NoFollowSpider, self)._compile_rules()

    # Keep going as before
    self.moreparams = moreparams

Thank you all for your help!

like image 94
Antoine Brunel Avatar answered Dec 21 '22 02:12

Antoine Brunel