I've been stuck on this for a few days, and it's making me go crazy.
I call my scrapy spider like this:
scrapy crawl example -a follow_links="True"
I pass in the "follow_links" flag to determine whether the entire website should be scraped, or just the index page I have defined in the spider.
This flag is checked in the spider's constructor to see which rule should be set:
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
self.follow_links = kwargs.get('follow_links')
if self.follow_links == "True":
self.rules = (
Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
)
else:
self.rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
If it's "True", all links are allowed; if it's "False", all links are denied.
So far, so good, however these rules are ignored. The only way I can get rules to be followed is if I define them outside of the constructor. That means, something like this would work correctly:
class ExampleSpider(CrawlSpider):
rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
def __init__(self, *args, **kwargs):
...
So basically, defining the rules within the __init__
constructor causes the rules to be ignored, whereas defining the rules outside of the constructor works as expected.
I cannot understand this. My code is below.
import re
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags, remove_comments, replace_escape_chars, replace_entities, remove_tags_with_content
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
# if the rule below is uncommented, it works as expected (i.e. follow links and call parse_pages)
# rules = (
# Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
# )
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
# single page or follow links
self.follow_links = kwargs.get('follow_links')
if self.follow_links == "True":
# the rule below will always be ignored (why?!)
self.rules = (
Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
)
else:
# the rule below will always be ignored (why?!)
self.rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
def parse_pages(self, response):
print("In parse_pages")
print(response.xpath('/html/body').extract())
return None
def parse_start_url(self, response):
print("In parse_start_url")
print(response.xpath('/html/body').extract())
return None
Thank you for taking the time to help me on this matter.
The problem here is that CrawlSpider
constructor (__init__
) is also handling the rules
parameter, so if you need to assign them, you'll have to do it before calling the default constructor.
In other words do everything you need before calling super(ExampleSpider, self).__init__(*args, **kwargs)
:
def __init__(self, *args, **kwargs):
# setting my own rules
super(ExampleSpider, self).__init__(*args, **kwargs)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With