Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ignore already visited urls in scrapy

Tags:

python

scrapy

This is my custom_filters.py file:

from scrapy.dupefilter import RFPDupeFilter

class SeenURLFilter(RFPDupeFilter):

    def __init__(self, path=None):
        self.urls_seen = set()
        RFPDupeFilter.__init__(self, path)

    def request_seen(self, request):
        if request.url in self.urls_seen:
           return True
        else:
           self.urls_seen.add(request.url)

Added the line:

   DUPEFILTER_CLASS = 'crawl_website.custom_filters.SeenURLFilter'

to settings.py

When I check the csv file generated it shows one url many times. Is this wrong?

like image 382
blackmamba Avatar asked Nov 01 '22 07:11

blackmamba


1 Answers

From: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

Then in your settings.py add:

ITEM_PIPELINES = {
  'your_bot_name.pipelines.DuplicatesPipeline': 100
}

EDIT:

To check for duplicate URLs use:

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):
    def __init__(self):
        self.urls_seen = set()

    def process_item(self, item, spider):
        if item['url'] in self.urls_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.urls_seen.add(item['url'])
            return item

this requires a url = Field() in your item. Something like this (items.py):

from scrapy.item import Item, Field

class PageItem(Item):
    url            = Field()
    scraped_field_a = Field()
    scraped_field_b = Field()
like image 80
mattes Avatar answered Nov 09 '22 10:11

mattes