Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

modifying urls before sending for fetching in scrapy

I want to parse sitemap and find out all urls from sitemap and then appending some word to all urls and then I want to check response code of all modified urls.

for this task I decided to use scrapy because it have luxury to crawl sitemaps. its given in scarpy's documentation

with the help of this documentation I created my spider. but I want to change urls before sending for fetching. so for this I tried to take help from this link. this link suggested my to use rules and implement process_requests(). but I am not able to make use of these. I tired little bit that I have commented. could anyone help me write exact code for commented lines or any other ways to do this task in scrapy?

from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']
    #sitemap_rules = [some_rules, process_request='process_request')]

    #def process_request(self, request, spider):
    #   modified_url=orginal_url_from_sitemap + 'myword'
    #   return request.replace(url = modified_url)        

    def parse(self, response):
        print response.status, response.url  
like image 510
Alok Avatar asked Oct 31 '22 15:10

Alok


2 Answers

You can attach the request_scheduled signal to a function and do what you want in the function. For example

class MySpider(SitemapSpider):

    @classmethod
    def from_crawler(cls, crawler):
        spider = cls()
        crawler.signals.connect(spider.request_scheduled, signals.request_scheduled)

    def request_scheduled(self, request, spider):
        modified_url = orginal_url_from_sitemap + 'myword'
        request.url = modified_url
like image 129
zczhuohuo Avatar answered Nov 09 '22 13:11

zczhuohuo


SitemapSpider has sitemap_filter method.
You can override it to implement required functionality.

class MySpider(SitemapSpider):

    ...
        def sitemap_filter(self, entries):

            for entry in entries:
                entry["loc"] = entry["loc"] + myword
                yield entry

Each of that entry objects are dicts with structure like this:

<class 'dict'>:
 {'loc': 'https://example.com/',
  'lastmod': '2019-01-04T08:09:23+00:00',
  'changefreq': 'weekly',
  'priority': '0.8'}

Important note!. SitemapSpider.sitemap_filter method appeared on scrapy 1.6.0 released on Jan 2019 1.6.0 release notes - new extensibility features section

like image 34
Georgiy Avatar answered Nov 09 '22 14:11

Georgiy