This is my custom_filters.py file:
from scrapy.dupefilter import RFPDupeFilter
class SeenURLFilter(RFPDupeFilter):
def __init__(self, path=None):
self.urls_seen = set()
RFPDupeFilter.__init__(self, path)
def request_seen(self, request):
if request.url in self.urls_seen:
return True
else:
self.urls_seen.add(request.url)
Added the line:
DUPEFILTER_CLASS = 'crawl_website.custom_filters.SeenURLFilter'
to settings.py
When I check the csv file generated it shows one url many times. Is this wrong?
From: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
Then in your settings.py
add:
ITEM_PIPELINES = {
'your_bot_name.pipelines.DuplicatesPipeline': 100
}
EDIT:
To check for duplicate URLs use:
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.urls_seen = set()
def process_item(self, item, spider):
if item['url'] in self.urls_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.urls_seen.add(item['url'])
return item
this requires a url = Field()
in your item. Something like this (items.py):
from scrapy.item import Item, Field
class PageItem(Item):
url = Field()
scraped_field_a = Field()
scraped_field_b = Field()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With