Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Middleware to ignore duplicates in Scrapy

Tags:

python

scrapy

I'm beginner in Python, and I'm using Scrapy for a personnel web project.

I use Scrapy to extract data from several websites repeatedly, so I need to check on every crawling if a link is already in the database before adding it. I did this in a piplines.py class:

class DuplicatesPipline(object):
    def process_item(self, item, spider):
        if memc2.get(item['link']) is None:
            return item
        else:
            raise DropItem('Duplication %s', item['link'])

But I heard that using Middleware is better for this task.

I found it a little hard to use Middleware in Scrapy, can anyone please redirect me to a good tutorial.

advices are welcome.

Thanks,

Edit:

I'm using MySql and memcache.

Here is my try according to @Talvalin answer:

# -*- coding: utf-8 -*-

from scrapy.exceptions import IgnoreRequest
import MySQLdb as mdb
import memcache

connexion = mdb.connect('localhost','dev','passe','mydb')
memc2 = memcache.Client(['127.0.0.1:11211'], debug=1)

class IgnoreDuplicates():

    def __init__(self):
        #clear memcache object
        memc2.flush_all()

        #update memc2
        with connexion:
            cur = connexion.cursor()
            cur.execute('SELECT link, title FROM items')
            for item in cur.fetchall():
                memc2.set(item[0], item[1])

    def precess_request(self, request, spider):
        #if the url is not in memc2 keys, it returns None.
        if memc2.get(request.url) is None:
            return None
        else:
            raise IgnoreRequest()

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.IgnoreDuplicates': 543,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 500, }

But it seems that the process_request method is ignored when crawling.

Thanks in advance,

like image 990
elhoucine Avatar asked Apr 09 '14 13:04

elhoucine


1 Answers

Here's some example middleware code that loads urls from a sqlite3 table (Id INT, url TEXT)into a set, and then checks request urls against the set to determine if the url should be ignored or not. It should be reasonably straightforward to adapt this code to use MySQL and memcache, but please let me know if you have any issues or questions. :)

import sqlite3
from scrapy.exceptions import IgnoreRequest

class IgnoreDuplicates():

    def __init__(self):
        self.crawled_urls = set()

        with sqlite3.connect('C:\dev\scrapy.db') as conn:
            cur = conn.cursor()
            cur.execute("""SELECT url FROM CrawledURLs""")
            self.crawled_urls.update(x[0] for x in cur.fetchall())

        print self.crawled_urls

    def process_request(self, request, spider):
        if request.url in self.crawled_urls:
            raise IgnoreRequest()
        else:
            return None

On the off-chance you have import issues like me and are about to punch your monitor, the code above was in a middlewares.py file that was placed in the top-level project folder with the following DOWNLOADER_MIDDLEWARES setting

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.IgnoreDuplicates': 543,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 500,
}
like image 90
Talvalin Avatar answered Sep 16 '22 14:09

Talvalin