I'm beginner in Python, and I'm using Scrapy for a personnel web project.
I use Scrapy to extract data from several websites repeatedly, so I need to check on every crawling if a link is already in the database before adding it. I did this in a piplines.py class:
class DuplicatesPipline(object):
def process_item(self, item, spider):
if memc2.get(item['link']) is None:
return item
else:
raise DropItem('Duplication %s', item['link'])
But I heard that using Middleware is better for this task.
I found it a little hard to use Middleware in Scrapy, can anyone please redirect me to a good tutorial.
advices are welcome.
Thanks,
Edit:
I'm using MySql and memcache.
Here is my try according to @Talvalin answer:
# -*- coding: utf-8 -*-
from scrapy.exceptions import IgnoreRequest
import MySQLdb as mdb
import memcache
connexion = mdb.connect('localhost','dev','passe','mydb')
memc2 = memcache.Client(['127.0.0.1:11211'], debug=1)
class IgnoreDuplicates():
def __init__(self):
#clear memcache object
memc2.flush_all()
#update memc2
with connexion:
cur = connexion.cursor()
cur.execute('SELECT link, title FROM items')
for item in cur.fetchall():
memc2.set(item[0], item[1])
def precess_request(self, request, spider):
#if the url is not in memc2 keys, it returns None.
if memc2.get(request.url) is None:
return None
else:
raise IgnoreRequest()
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.IgnoreDuplicates': 543,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 500, }
But it seems that the process_request method is ignored when crawling.
Thanks in advance,
Here's some example middleware code that loads urls from a sqlite3 table (Id INT, url TEXT)
into a set, and then checks request urls against the set to determine if the url should be ignored or not. It should be reasonably straightforward to adapt this code to use MySQL and memcache, but please let me know if you have any issues or questions. :)
import sqlite3
from scrapy.exceptions import IgnoreRequest
class IgnoreDuplicates():
def __init__(self):
self.crawled_urls = set()
with sqlite3.connect('C:\dev\scrapy.db') as conn:
cur = conn.cursor()
cur.execute("""SELECT url FROM CrawledURLs""")
self.crawled_urls.update(x[0] for x in cur.fetchall())
print self.crawled_urls
def process_request(self, request, spider):
if request.url in self.crawled_urls:
raise IgnoreRequest()
else:
return None
On the off-chance you have import issues like me and are about to punch your monitor, the code above was in a middlewares.py
file that was placed in the top-level project folder with the following DOWNLOADER_MIDDLEWARES
setting
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.IgnoreDuplicates': 543,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 500,
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With