I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. My idea is to only print them and not scrape them. My current piece of code is:
import scrapy
import os
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'rust'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
#if response.status == 301:
print response.url
However, this does not print the redirected urls. Any help will be appreciated.
Thank you.
To parse any responses that are not 200 you'd need to do one of these things:
You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...]
in settings.py
file. Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True
instead.
Add handle_httpstatus_list
parameter to your spider. In your case something like:
class MySpider(scrapy.Spider):
handle_httpstatus_list = [301]
# or
handle_httpstatus_all = True
You can set these meta
keys in your requests handle_httpstatus_list = [301, 302,...]
or handle_httpstatus_all = True
for all:
scrapy.request('http://url.com', meta={'handle_httpstatus_list': [301]})
To learn more see HttpErrorMiddleware
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With