Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy- how to stop Redirect (302)

Tags:

I'm trying to crawl a url using Scrapy. But it redirects me to page that doesn't exist.

Redirecting (302) to <GET http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197> from <GET http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx> 

The problem is http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx exists, but http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197 doesn't, so the crawler cant find this. I've crawled many other websites as well but didn't have this problem anywhere else. Is there a way I can stop this redirect?

Any help would be much appreciated. Thanks.

Update: This is my spider class

class Inon_Spider(BaseSpider): name = 'Inon' allowed_domains = ['www.shop.inonit.in']  start_urls = ['http://www.shop.inonit.in/Products/Inonit-Gadget-Accessories-Mobile-Covers/-The-Red-Tag/Samsung-Note-2-Dead-Mau/pid-2656465.aspx']  def parse(self, response):      item = DealspiderItem()     hxs = HtmlXPathSelector(response)      title = hxs.select('//div[@class="aboutproduct"]/div[@class="container9"]/div[@class="ctl_aboutbrand"]/h1/text()').extract()     price = hxs.select('//span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_spnWebPrice"]/span[@class="offer"]/span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_lblOfferPrice"]/text()').extract()     prc = price[0].replace("Rs.  ","")     description = []      item['price'] = prc     item['title'] = title     item['description'] = description     item['url'] = response.url      return item 
like image 791
user_2000 Avatar asked Mar 18 '13 12:03

user_2000


People also ask

How do you handle 302 redirect in Scrapy?

To retry such a response, add 'handle_httpstatus_list': [302] to the meta of the source request, and check if response. status == 302 in the callback. If it is, retry your request by yielding response.

How do I redirect on Scrapy?

Configuration. Install scrapy-redirect in your Scrapy middlewares by adding the following key/value pair in the SPIDER_MIDDLEWARES settings key (in settings.py): SPIDER_MIDDLEWARES = { ... 'scrapyredirect.

How do you handle Scrapy 301?

You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...] in settings.py file. Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True instead.


1 Answers

yes you can do this simply by adding meta values like

meta={'dont_redirect': True} 

also you can stop redirected for a particular response code like

meta={'dont_redirect': True,"handle_httpstatus_list": [302]} 

it will stop redirecting only 302 response codes. you can add as many http status code you want to avoid redirecting them.

example

yield Request('some url',     meta = {         'dont_redirect': True,         'handle_httpstatus_list': [302]     },     callback= self.some_call_back) 
like image 200
akhter wahab Avatar answered Sep 23 '22 14:09

akhter wahab