What is the difference between the Duplicate Filter which exists in the Scheduler and the IgnoreVisitedItems middleware?
Google group thread which suggests that there is a duplicate filter present in the Scheduler: http://groups.google.com/group/scrapy-users/browse_thread/thread/8e218bcc5b293532
The duplicate filter in the scheduler only filters out the URLs already seen in a single spider run (meaning that it will get reset on subsequent runs). The IgnoreVistedItems middleware will keep a state between runs and avoiding visiting URLs seen in the past, but only for final item urls so that the rest of the site can be re-crawled (in order to find new items).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With