Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoid Duplicate URL Crawling

Tags:

scrapy

I coded a simple crawler. In the settings.py file, by referring to scrapy documentation, I used

DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'

If I stop the crawler and restart the crawler again, it is scraping the duplicate urls again. Am I doing something wrong?

like image 347
user1787687 Avatar asked Jul 15 '13 17:07

user1787687


2 Answers

I believe what you are looking for is "persistence support", to pause and resume crawls.

To enable it you can do:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

You can read more about it here.

like image 62
Jason Youk Avatar answered Nov 19 '22 16:11

Jason Youk


According to the documentation, DUPEFILTER_CLASS is already set to scrapy.dupefilter.RFPDupeFilter by default.

RFPDupeFilter doesn't help if you stop the crawler - it only works while actual crawling, helps you to avoid scraping duplicate urls.

It looks like you need to create your own, custom filter based on RFPDupeFilter, like it was done here: how to filter duplicate requests based on url in scrapy. If you want your filter to work between scrapy crawl sessions, you should keep the list of crawled urls somewhere in the database, or csv file.

Hope that helps.

like image 20
alecxe Avatar answered Nov 19 '22 17:11

alecxe