Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Retrieve spider object in dupefilter

Tags:

python

scrapy

This is scrapy's default Dupefilter class method request_seen

class RFPDupeFilter(BaseDupeFilter):

    def request_seen(self, request):
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

While implementing a custom dupefilter. i cannot retrieve the spider object from this class unlike other scrapy middleware

Is there any way i can know which spider object this is? so i can customize it via a spider on spider basis?

Also i cannot just implement a middleware which reads urls and puts it into a list & checks duplicates instead of a custom dupefilter. This is because i need to pause/resume crawls and need scrapy to store the request fingerprint by default using the JOBDIR setting

like image 856
wolfgang Avatar asked Sep 10 '15 06:09

wolfgang


People also ask

What are the default spiders of Scrapy?

The default spiders of Scrapy are as follows − It is a spider from which every other spiders must inherit. It has the following class − The following table shows the fields of scrapy.Spider class − It is the name of your spider. It is a list of domains on which the spider crawls.

How to filter for Duplicate requests in Scrapy?

The class used to detect and filter duplicate requests. The default ( RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function. In order to change the way duplicates are checked you could subclass RFPDupeFilter and override its request_fingerprint method.

How do I edit a spider in Scrapy?

The editor to use for editing spiders with the edit command. Additionally, if the EDITOR environment variable is set, the edit command will prefer it over the default setting. A dict containing the extensions enabled in your project, and their orders. A dict containing the extensions available by default in Scrapy, and their orders.

How do spiders scrape data from XML feeds?

Let’s take a look at the following example, where spider starts crawling demoexample.com's home page, collecting all pages, links, and parses with the parse_items method − It is the base class for spiders that scrape from XML feeds and iterates over nodes.


1 Answers

If you really want that, a solution can be to override the request_seen method signature of the RFPDupeFilter, so that it receives 2 arguments (self, request, spider); than you need to override also the scrapy Scheuler's enqueue_request method because request_seen is called inside. You can creat new scheduler and new dupefilter like this:

# /scheduler.py

from scrapy.core.scheduler import Scheduler


class MyScheduler(Scheduler):

    def enqueue_request(self, request):
        if not request.dont_filter and self.df.request_seen(request, self.spider):
            self.df.log(request, self.spider)
            return False
        dqok = self._dqpush(request)
        if dqok:
            self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider)
        else:
            self._mqpush(request)
            self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider)
        self.stats.inc_value('scheduler/enqueued', spider=self.spider)
        return True

-

# /dupefilters.py

from scrapy.dupefilters import RFPDupeFilter


class MyRFPDupeFilter(RFPDupeFilter):

    def request_seen(self, request, spider):
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

        # Do things with spider

and set their paths in settings.py:

# /settings.py

DUPEFILTER_CLASS = 'myproject.dupefilters.MyRFPDupeFilter'
SCHEDULER = 'myproject.scheduler.MyScheduler' 
like image 172
sergiuz Avatar answered Oct 21 '22 23:10

sergiuz