Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: Rules set inside __init__ are ignored by CrawlSpider

I've been stuck on this for a few days, and it's making me go crazy.

I call my scrapy spider like this:

scrapy crawl example -a follow_links="True"

I pass in the "follow_links" flag to determine whether the entire website should be scraped, or just the index page I have defined in the spider.

This flag is checked in the spider's constructor to see which rule should be set:

def __init__(self, *args, **kwargs):

    super(ExampleSpider, self).__init__(*args, **kwargs)

    self.follow_links = kwargs.get('follow_links')
    if self.follow_links == "True":
        self.rules = (
            Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
        )
    else:
        self.rules = (
            Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
        )

If it's "True", all links are allowed; if it's "False", all links are denied.

So far, so good, however these rules are ignored. The only way I can get rules to be followed is if I define them outside of the constructor. That means, something like this would work correctly:

class ExampleSpider(CrawlSpider):

    rules = (
        Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
    )

    def __init__(self, *args, **kwargs):
        ...

So basically, defining the rules within the __init__ constructor causes the rules to be ignored, whereas defining the rules outside of the constructor works as expected.

I cannot understand this. My code is below.

import re
import scrapy

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags, remove_comments, replace_escape_chars, replace_entities, remove_tags_with_content


class ExampleSpider(CrawlSpider):

    name = "example"
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']    
    # if the rule below is uncommented, it works as expected (i.e. follow links and call parse_pages)
    # rules = (
    #     Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
    # )

    def __init__(self, *args, **kwargs):

        super(ExampleSpider, self).__init__(*args, **kwargs)

        # single page or follow links
        self.follow_links = kwargs.get('follow_links')
        if self.follow_links == "True":
            # the rule below will always be ignored (why?!)
            self.rules = (
                Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
            )
        else:
            # the rule below will always be ignored (why?!)
            self.rules = (
                Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
            )


    def parse_pages(self, response):
        print("In parse_pages")
        print(response.xpath('/html/body').extract())
        return None


    def parse_start_url(self, response):
        print("In parse_start_url")
        print(response.xpath('/html/body').extract())
        return None

Thank you for taking the time to help me on this matter.

like image 394
Tom Brock Avatar asked Jan 05 '23 08:01

Tom Brock


1 Answers

The problem here is that CrawlSpider constructor (__init__) is also handling the rules parameter, so if you need to assign them, you'll have to do it before calling the default constructor.

In other words do everything you need before calling super(ExampleSpider, self).__init__(*args, **kwargs) :

def __init__(self, *args, **kwargs):
    # setting my own rules
    super(ExampleSpider, self).__init__(*args, **kwargs)
like image 138
eLRuLL Avatar answered Jan 08 '23 05:01

eLRuLL