Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using middleware to prevent scrapy from double-visiting websites

I have a problem like this:

how to filter duplicate requests based on url in scrapy

So, I do not want a website to be crawled more than once. I adapted the middleware and wrote a print statement to test whether it correctly classifies already seen websites. It does.

Nonetheless the parsing seems to be executed multiple times because the json-File I receive contains double entries.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

from crawlspider.items import KickstarterItem

from HTMLParser import HTMLParser

### code for stripping off HTML tags:
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return str(''.join(self.fed))

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
###

items = []

class MySpider(CrawlSpider):
    name = 'kickstarter'
    allowed_domains = ['kickstarter.com']
    start_urls = ['http://www.kickstarter.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('discover/categories/comics', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('projects/', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)

        hxs = HtmlXPathSelector(response)
        item = KickstarterItem()

        item['date'] = hxs.select('//*[@id="about"]/div[2]/ul/li[1]/text()').extract()
        item['projname'] = hxs.select('//*[@id="title"]/a').extract()
        item['projname'] = strip_tags(str(item['projname']))

        item['projauthor'] = hxs.select('//*[@id="name"]')
        item['projauthor'] = item['projauthor'].select('string()').extract()[0]

        item['backers'] = hxs.select('//*[@id="backers_count"]/data').extract()
        item['backers'] = strip_tags(str(item['backers']))

        item['collmoney'] = hxs.select('//*[@id="pledged"]/data').extract()
        item['collmoney'] = strip_tags(str(item['collmoney']))

        item['goalmoney'] = hxs.select('//*[@id="stats"]/h5[2]/text()').extract()
        items.append(item)
        return items

My items.py looks like that:

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class KickstarterItem(Item):
    # define the fields for your item here like:
    date = Field()
    projname = Field()
    projauthor = Field()
    backers = Field()
    collmoney = Field()
    goalmoney = Field()
    pass

My middleware looks like this:

import os

from scrapy.dupefilter import RFPDupeFilter
from scrapy.utils.request import request_fingerprint

class CustomFilter(RFPDupeFilter):
def __getid(self, url):
    mm = url.split("/")[4] #extracts project-id (is a number) from project-URL
    print "_____________", mm
    return mm

def request_seen(self, request):
    fp = self.__getid(request.url)
    self.fingerprints.add(fp)
    if fp in self.fingerprints and fp.isdigit(): # .isdigit() checks wether fp comes from a project ID
        print "______fp is a number (therefore a project-id) and has been encountered before______"
        return True
    if self.file:
        self.file.write(fp + os.linesep)

I added this line to settings.py:

DUPEFILTER_CLASS = 'crawlspider.duplicate_filter.CustomFilter'

I call the script using "scrapy crawl kickstarter -o items.json -t json". Then I see the correct print statements from the middleware code. Any comments on why the json contains multiple entries containing the same data?

like image 411
Damian Avatar asked Feb 02 '13 20:02

Damian


1 Answers

So now these are the three modifications that removed the duplicates:

I added this to settings.py: ITEM_PIPELINES = ['crawlspider.pipelines.DuplicatesPipeline',]

to let scrapy know that I added a function DuplicatesPipeline in pipelines.py:

from scrapy import signals
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['projname'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['projname'])
            return item

You do not need to adjust the spider and do not use the dupefilter/middleware stuff I posted before.

But I got the feeling that my solution doesn't reduce the communication as the Item-object has to be created first before it is evaluated and possibly dropped. But I am okay with that.

(Solution found by asker, moved into an answer)

like image 183
Jason S Avatar answered Nov 07 '22 03:11

Jason S