scrapy filtering duplicate requests

Question

What is the difference between the Duplicate Filter which exists in the Scheduler and the IgnoreVisitedItems middleware?

Google group thread which suggests that there is a duplicate filter present in the Scheduler: http://groups.google.com/group/scrapy-users/browse_thread/thread/8e218bcc5b293532

Pablo Hoffman · Accepted Answer

The duplicate filter in the scheduler only filters out the URLs already seen in a single spider run (meaning that it will get reset on subsequent runs). The IgnoreVistedItems middleware will keep a state between runs and avoiding visiting URLs seen in the past, but only for final item urls so that the rest of the site can be re-crawled (in order to find new items).

scrapy filtering duplicate requests

Tags:

python

scrapy

Divick

1 Answers

Pablo Hoffman

Recent Activity

Donate For Us

scrapy filtering duplicate requests

Tags:

python

scrapy

Divick

1 Answers

Pablo Hoffman

Related questions

Recent Activity

Donate For Us