This is the code, I have written to scrape justdial website.
import scrapy
from scrapy.http.request import Request
class JustdialSpider(scrapy.Spider):
name = 'justdial'
# handle_httpstatus_list = [400]
# headers={'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
# handle_httpstatus_list = [403, 404]
allowed_domains = ['justdial.com']
start_urls = ['https://www.justdial.com/Delhi-NCR/Chemists/page-1']
# def start_requests(self):
# # hdef start_requests(self):
# headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
# for url in self.start_urls:
# self.log("I just visited :---------------------------------- "+url)
# yield Request(url, headers=headers)
def parse(self,response):
self.log("I just visited the site:---------------------------------------------- "+response.url)
urls = response.xpath('//a/@href').extract()
self.log("Urls-------: "+str(urls))
This is Error is showing in Terminal:
2017-08-18 18:32:25 [scrapy.core.engine] INFO: Spider opened
2017-08-18 18:32:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-08-18 18:32:25 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache
storage in D:\scrapy\justdial\.scrapy\httpcache
2017-08-18 18:32:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening o
n 127.0.0.1:6023
2017-08-18 18:32:25 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.j
ustdial.com/robots.txt> (referer: None) ['cached']
2017-08-18 18:32:25 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.j
ustdial.com/Delhi-NCR/Chemists/page-1> (referer: None) ['cached']
2017-08-18 18:32:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response
<403 https://www.justdial.com/Delhi-NCR/Chemists/page-1>: HTTP status code is n
ot handled or not allowed
I have seen the similar questions on stackoverflow i tried everything like, You can see in Code with comment what i tried,
changed the UserAgents
Setting handle_httpstatus_list = [400]
Note: This (https://www.justdial.com/Delhi-NCR/Chemists/page-1) website not even blocked in my system. When i open the website in chrome/mozilla, it's opening. This is same error with (https://www.practo.com/bangalore#doctor-search) site also.
Like Avihoo Mamka mentioned in the comment you need to provide some extra request headers to not get rejected by this website. In this case it seems to just be the User-Agent header. By default scrapy identifies itself with user agent "Scrapy/{version}(+http://scrapy.org)" .
When you set user agent using an user_agent
spider attribute, it starts to work. Probably setting request headers is not enough as it gets overridden by default user agent string. So set spider attribute
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
(the same way you set start_urls
) and try it.
As (Tomáš Linhart) mentioned,
We have to add a useragents
setting in setting.py
, like,
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With