Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to disable cache in scrapy?

Tags:

caching

scrapy

I am trying to crawl a webpage on a particular website.The webpage varies a little for different set of cookies that I sent through scrapy.Request().

If I make the request to webpage one by one , it gives me the correct result, but when I send these cookies in for loop, it is giving me the same result. I think scrapy is creating cache for me and in the second request its taking the response from that cache.Here is my code :

def start_requests(self):
        meta = {'REDIRECT_ENABLED':True}
        productUrl = "http://xyz"
        cookies = [{'name': '', 'value': '=='},{'name': '', 'value': '=='}]
        for cook in cookies:

            header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"}
            productResponse = scrapy.Request(productUrl,callback=self.parseResponse,method='GET',meta=meta,body=str(),cookies=[cook],encoding='utf-8',priority=0,dont_filter=True)
            yield productResponse


def parseResponse(self,response): 
     selector = Selector(response)
     print selector.xpath("xpaths here").extract()
     yield None

I expect that the print statement should give different result for the two requests.

If anything isn't clear , please mention in comments.

like image 861
sagar Avatar asked Sep 16 '15 17:09

sagar


2 Answers

Cache can be disable in 2 ways

  1. Changing values in cache related settings in setting.py file. By Keeping HTTPCACHE_ENABLED=False
  2. Or it can be done in runtime " scrapy crawl crawl-name --set HTTPCACHE_ENABLED=False
like image 155
Niranjan Sagar Avatar answered Oct 03 '22 14:10

Niranjan Sagar


Here I assume that you just want to avoid caching only specific requests.

For this example it means avoid caching those requests under start_requests and cache all other requests (which you may have under parseResponse).

To do this just add productResponse.meta['dont_cache'] = True line to your code and set HTTPCACHE_ENABLED=True under settings.py

Now all other requests will be cached.

def start_requests(self):
        meta = {'REDIRECT_ENABLED':True}
        productUrl = "http://xyz"
        cookies = [{'name': '', 'value': '=='},{'name': '', 'value': '=='}]
        for cook in cookies:

            header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"}
            productResponse = scrapy.Request(productUrl,callback=self.parseResponse,method='GET',
                                             meta=meta,body=str(),cookies=[cook],
                                             encoding='utf-8',priority=0,dont_filter=True)
            productResponse.meta['dont_cache'] = True
            yield productResponse

def parseResponse(self,response): 
     selector = Selector(response)
     print selector.xpath("xpaths here").extract()
     yield None
like image 29
Levon Avatar answered Oct 03 '22 14:10

Levon