This is scrapy's default Dupefilter
class method request_seen
class RFPDupeFilter(BaseDupeFilter):
def request_seen(self, request):
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
While implementing a custom dupefilter. i cannot retrieve the spider
object from this class unlike other scrapy middleware
Is there any way i can know which spider
object this is? so i can customize it via a spider on spider basis?
Also i cannot just implement a middleware which reads urls and puts it into a list & checks duplicates instead of a custom dupefilter. This is because i need to pause/resume crawls and need scrapy to store the request fingerprint by default using the JOBDIR
setting
The default spiders of Scrapy are as follows − It is a spider from which every other spiders must inherit. It has the following class − The following table shows the fields of scrapy.Spider class − It is the name of your spider. It is a list of domains on which the spider crawls.
The class used to detect and filter duplicate requests. The default ( RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function. In order to change the way duplicates are checked you could subclass RFPDupeFilter and override its request_fingerprint method.
The editor to use for editing spiders with the edit command. Additionally, if the EDITOR environment variable is set, the edit command will prefer it over the default setting. A dict containing the extensions enabled in your project, and their orders. A dict containing the extensions available by default in Scrapy, and their orders.
Let’s take a look at the following example, where spider starts crawling demoexample.com's home page, collecting all pages, links, and parses with the parse_items method − It is the base class for spiders that scrape from XML feeds and iterates over nodes.
If you really want that, a solution can be to override the request_seen
method signature of the RFPDupeFilter
, so that it receives 2 arguments (self, request, spider)
; than you need to override also the scrapy Scheuler's
enqueue_request
method because request_seen
is called inside. You can creat new scheduler and new dupefilter like this:
# /scheduler.py
from scrapy.core.scheduler import Scheduler
class MyScheduler(Scheduler):
def enqueue_request(self, request):
if not request.dont_filter and self.df.request_seen(request, self.spider):
self.df.log(request, self.spider)
return False
dqok = self._dqpush(request)
if dqok:
self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider)
else:
self._mqpush(request)
self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider)
self.stats.inc_value('scheduler/enqueued', spider=self.spider)
return True
-
# /dupefilters.py
from scrapy.dupefilters import RFPDupeFilter
class MyRFPDupeFilter(RFPDupeFilter):
def request_seen(self, request, spider):
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
# Do things with spider
and set their paths in settings.py:
# /settings.py
DUPEFILTER_CLASS = 'myproject.dupefilters.MyRFPDupeFilter'
SCHEDULER = 'myproject.scheduler.MyScheduler'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With