I am trying to crawl a webpage on a particular website.The webpage varies a little for different set of cookies that I sent through scrapy.Request()
.
If I make the request to webpage one by one , it gives me the correct result, but when I send these cookies in for loop, it is giving me the same result. I think scrapy is creating cache for me and in the second request its taking the response from that cache.Here is my code :
def start_requests(self):
meta = {'REDIRECT_ENABLED':True}
productUrl = "http://xyz"
cookies = [{'name': '', 'value': '=='},{'name': '', 'value': '=='}]
for cook in cookies:
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"}
productResponse = scrapy.Request(productUrl,callback=self.parseResponse,method='GET',meta=meta,body=str(),cookies=[cook],encoding='utf-8',priority=0,dont_filter=True)
yield productResponse
def parseResponse(self,response):
selector = Selector(response)
print selector.xpath("xpaths here").extract()
yield None
I expect that the print statement should give different result for the two requests.
If anything isn't clear , please mention in comments.
Cache can be disable in 2 ways
Here I assume that you just want to avoid caching only specific requests.
For this example it means avoid caching those requests under start_requests
and cache all other requests (which you may have under parseResponse
).
To do this just add productResponse.meta['dont_cache'] = True
line to your code and
set HTTPCACHE_ENABLED=True
under settings.py
Now all other requests will be cached.
def start_requests(self):
meta = {'REDIRECT_ENABLED':True}
productUrl = "http://xyz"
cookies = [{'name': '', 'value': '=='},{'name': '', 'value': '=='}]
for cook in cookies:
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"}
productResponse = scrapy.Request(productUrl,callback=self.parseResponse,method='GET',
meta=meta,body=str(),cookies=[cook],
encoding='utf-8',priority=0,dont_filter=True)
productResponse.meta['dont_cache'] = True
yield productResponse
def parseResponse(self,response):
selector = Selector(response)
print selector.xpath("xpaths here").extract()
yield None
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With