I want to parse sitemap and find out all urls from sitemap and then appending some word to all urls and then I want to check response code of all modified urls.
for this task I decided to use scrapy because it have luxury to crawl sitemaps. its given in scarpy's documentation
with the help of this documentation I created my spider. but I want to change urls before sending for fetching. so for this I tried to take help from this link. this link suggested my to use rules
and implement process_requests()
. but I am not able to make use of these. I tired little bit that I have commented. could anyone help me write exact code for commented lines or any other ways to do this task in scrapy?
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
#sitemap_rules = [some_rules, process_request='process_request')]
#def process_request(self, request, spider):
# modified_url=orginal_url_from_sitemap + 'myword'
# return request.replace(url = modified_url)
def parse(self, response):
print response.status, response.url
You can attach the request_scheduled signal to a function and do what you want in the function. For example
class MySpider(SitemapSpider):
@classmethod
def from_crawler(cls, crawler):
spider = cls()
crawler.signals.connect(spider.request_scheduled, signals.request_scheduled)
def request_scheduled(self, request, spider):
modified_url = orginal_url_from_sitemap + 'myword'
request.url = modified_url
SitemapSpider
has sitemap_filter
method.
You can override it to implement required functionality.
class MySpider(SitemapSpider):
...
def sitemap_filter(self, entries):
for entry in entries:
entry["loc"] = entry["loc"] + myword
yield entry
Each of that entry
objects are dicts with structure like this:
<class 'dict'>:
{'loc': 'https://example.com/',
'lastmod': '2019-01-04T08:09:23+00:00',
'changefreq': 'weekly',
'priority': '0.8'}
Important note!. SitemapSpider.sitemap_filter
method appeared on scrapy 1.6.0 released on Jan 2019 1.6.0 release notes - new extensibility features section
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With