Scrapy - Retrieve spider object in dupefilter

Tags:

scrapy

This is scrapy's default Dupefilter class method request_seen

class RFPDupeFilter(BaseDupeFilter):

    def request_seen(self, request):
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

While implementing a custom dupefilter. i cannot retrieve the spider object from this class unlike other scrapy middleware

Is there any way i can know which spider object this is? so i can customize it via a spider on spider basis?

Also i cannot just implement a middleware which reads urls and puts it into a list & checks duplicates instead of a custom dupefilter. This is because i need to pause/resume crawls and need scrapy to store the request fingerprint by default using the JOBDIR setting

856

asked Sep 10 '15 06:09

wolfgang

1 Answers

If you really want that, a solution can be to override the request_seen method signature of the RFPDupeFilter, so that it receives 2 arguments (self, request, spider); than you need to override also the scrapy Scheuler's enqueue_request method because request_seen is called inside. You can creat new scheduler and new dupefilter like this:

# /scheduler.py

from scrapy.core.scheduler import Scheduler


class MyScheduler(Scheduler):

    def enqueue_request(self, request):
        if not request.dont_filter and self.df.request_seen(request, self.spider):
            self.df.log(request, self.spider)
            return False
        dqok = self._dqpush(request)
        if dqok:
            self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider)
        else:
            self._mqpush(request)
            self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider)
        self.stats.inc_value('scheduler/enqueued', spider=self.spider)
        return True

# /dupefilters.py

from scrapy.dupefilters import RFPDupeFilter


class MyRFPDupeFilter(RFPDupeFilter):

    def request_seen(self, request, spider):
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

        # Do things with spider

and set their paths in settings.py:

# /settings.py

DUPEFILTER_CLASS = 'myproject.dupefilters.MyRFPDupeFilter'
SCHEDULER = 'myproject.scheduler.MyScheduler'

172

answered Oct 21 '22 23:10

sergiuz

Related questions
                            
                                Adaptive Bandwidth Kernel Density Estimation
                            
                                How to 'partially' install a Python package
                            
                                How to create a boxplot not showing the outliers using Python and Plotly?
                            
                                How to use keras for XOR
                            
                                Fast non-negative matrix factorization on large sparse matrix
                            
                                Resizing RGB image with cv2 numpy and Python 2.7
                            
                                Python Bottle multiple file upload
                            
                                Ordering queryset by distance relative to a given position
                            
                                Embedding Seaborn plot in WxPython panel
                            
                                How to correctly add Foreign Key constraints to SQLite DB using SQLAlchemy [duplicate]
                            
                                How to set axvlines to use the same colors from the axes.color_cycle in matplotlib?
                            
                                Construct caffe.Net object using NetParameter
                            
                                How to normalize by another row in a pandas DataFrame?
                            
                                Python - Remove header and footer from docx file
                            
                                Which GTK+ elements support which CSS properties?
                            
                                Can Biopython perform Seq.find() accounting for ambiguity codes
                            
                                BytesIO object to image
                            
                                sqlite3 database is locked
                            
                                Werkzeug and class state with Flask: How are class member variables resetting when the class isn't being reinitialized?
                            
                                Python Regular expression only matches once

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With