Error 403 : HTTP status code is not handled or not allowed in scrapy

Tags:

This is the code, I have written to scrape justdial website.

import scrapy
from scrapy.http.request import Request

class JustdialSpider(scrapy.Spider):
    name = 'justdial'
    # handle_httpstatus_list = [400]
    # headers={'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
    # handle_httpstatus_list = [403, 404]
    allowed_domains = ['justdial.com']
    start_urls = ['https://www.justdial.com/Delhi-NCR/Chemists/page-1']
    # def  start_requests(self):
    #     # hdef start_requests(self):
    #     headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
    #     for url in self.start_urls:
    #         self.log("I just visited :---------------------------------- "+url)
    #         yield Request(url, headers=headers)
    def parse(self,response):
        self.log("I just visited the site:---------------------------------------------- "+response.url)
         urls = response.xpath('//a/@href').extract()
         self.log("Urls-------: "+str(urls))

This is Error is showing in Terminal:

2017-08-18 18:32:25 [scrapy.core.engine] INFO: Spider opened
2017-08-18 18:32:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-08-18 18:32:25 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache
storage in D:\scrapy\justdial\.scrapy\httpcache
2017-08-18 18:32:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening o
n 127.0.0.1:6023
2017-08-18 18:32:25 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.j
ustdial.com/robots.txt> (referer: None) ['cached']
2017-08-18 18:32:25 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.j
ustdial.com/Delhi-NCR/Chemists/page-1> (referer: None) ['cached']
2017-08-18 18:32:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response
 <403 https://www.justdial.com/Delhi-NCR/Chemists/page-1>: HTTP status code is n
ot handled or not allowed

I have seen the similar questions on stackoverflow i tried everything like, You can see in Code with comment what i tried,

changed the UserAgents
Setting handle_httpstatus_list = [400]

Note: This (https://www.justdial.com/Delhi-NCR/Chemists/page-1) website not even blocked in my system. When i open the website in chrome/mozilla, it's opening. This is same error with (https://www.practo.com/bangalore#doctor-search) site also.

218

asked Aug 18 '17 13:08

Raguram Gopi

2 Answers

When you set user agent using an user_agent spider attribute, it starts to work. Probably setting request headers is not enough as it gets overridden by default user agent string. So set spider attribute

user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"

(the same way you set start_urls) and try it.

answered Sep 20 '22 11:09

Tomáš Linhart

As (Tomáš Linhart) mentioned, We have to add a useragents setting in setting.py, like,

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'

answered Sep 19 '22 11:09

Raguram Gopi

Related questions
                            
                                How to replace non integer values in a pandas Dataframe?
                            
                                python matplotlib make everything bold
                            
                                While else statement equivalent for Java?
                            
                                500 internal server error mod_wsgi apache "importerror: No Module named 'django'
                            
                                python pandas rename multiple column headers the same way
                            
                                How do I download images with an https URL in Python 3?
                            
                                How to stream the logs in docker python API?
                            
                                Finding index of pairwise elements
                            
                                Python version of R's ifelse statement
                            
                                Pandas: How do I repeat dataframe for each value in a series?
                            
                                BuildError: Could not build url for endpoint 'user' with values ['nickname']. Did you forget to specify values ['page', 'username']?
                            
                                UnicodeDecodeError, utf-8 invalid continuation byte
                            
                                ModuleNotFoundError: No module named 'pandas'
                            
                                yticklabels Cut Off in Pandas plot
                            
                                Managing dynamic plotting in matplotlib Animation module
                            
                                Printing from 1 to 99 using a print and for loop function
                            
                                Python in AWS Lambda: "module 'requests' has no attribute 'get'"
                            
                                Django 'str' object has no attribute 'values' in rest_framework
                            
                                Filter dataframe based on multiple columns of another dataframe
                            
                                python - Output by OpenCV VideoWriter empty

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Error 403 : HTTP status code is not handled or not allowed in scrapy

Tags:

python

http

scrapy

Raguram Gopi

People also ask

2 Answers

Tomáš Linhart

Raguram Gopi

Recent Activity

Donate For Us