Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy filtering duplicate requests

Tags:

python

scrapy

What is the difference between the Duplicate Filter which exists in the Scheduler and the IgnoreVisitedItems middleware?

Google group thread which suggests that there is a duplicate filter present in the Scheduler: http://groups.google.com/group/scrapy-users/browse_thread/thread/8e218bcc5b293532

like image 642
Divick Avatar asked Dec 12 '22 05:12

Divick


1 Answers

The duplicate filter in the scheduler only filters out the URLs already seen in a single spider run (meaning that it will get reset on subsequent runs). The IgnoreVistedItems middleware will keep a state between runs and avoiding visiting URLs seen in the past, but only for final item urls so that the rest of the site can be re-crawled (in order to find new items).

like image 119
Pablo Hoffman Avatar answered Dec 29 '22 08:12

Pablo Hoffman