I want to parse sitemap and find out all urls from sitemap and then appending some word to all urls and then I want to check response code of all modified urls.
for this task I decided to use scrapy because it have luxury to crawl sitemaps. its given in scarpy's documentation
with the help of this documentation I created my spider. but I want to change urls before sending for fetching. so for this I tried to take help from this link. this link suggested my to use rules and implement process_requests(). but I am not able to make use of these. I tired little bit that I have commented. could anyone help me write exact code for commented lines or any other ways to do this task in scrapy?
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']
    #sitemap_rules = [some_rules, process_request='process_request')]
    #def process_request(self, request, spider):
    #   modified_url=orginal_url_from_sitemap + 'myword'
    #   return request.replace(url = modified_url)        
    def parse(self, response):
        print response.status, response.url  
                You can attach the request_scheduled signal to a function and do what you want in the function. For example
class MySpider(SitemapSpider):
    @classmethod
    def from_crawler(cls, crawler):
        spider = cls()
        crawler.signals.connect(spider.request_scheduled, signals.request_scheduled)
    def request_scheduled(self, request, spider):
        modified_url = orginal_url_from_sitemap + 'myword'
        request.url = modified_url
                        SitemapSpider has sitemap_filter method.
You can override it to implement required functionality.
class MySpider(SitemapSpider):
    ...
        def sitemap_filter(self, entries):
            for entry in entries:
                entry["loc"] = entry["loc"] + myword
                yield entry
Each of that entry objects are dicts with structure like this:
<class 'dict'>:
 {'loc': 'https://example.com/',
  'lastmod': '2019-01-04T08:09:23+00:00',
  'changefreq': 'weekly',
  'priority': '0.8'}
Important note!. SitemapSpider.sitemap_filter method appeared on scrapy 1.6.0  released on Jan 2019 1.6.0 release notes - new extensibility features section
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With