I am using Scrapy to crawl several websites, which may share redundant information.
For each page I scrape, I store the url of the page, its title and its html code, into mongoDB.
I want to avoid duplication in database, thus, I implement a pipeline in order to check if a similar item is already stored. In such a case, I raise a DropItem
exception.
My problem is that whenever I drop an item by raison a DropItem
exception, Scrapy will display the entire content of the item into the log (stdout or file).
As I'm extracting the entire HTML code of each scraped page, in case of a drop, the whole HTML code will be displayed into the log.
How could I silently drop an item without its content being shown?
Thank you for your time!
class DatabaseStorage(object):
""" Pipeline in charge of database storage.
The 'whole' item (with HTML and text) will be stored in mongoDB.
"""
def __init__(self):
self.mongo = MongoConnector().collection
def process_item(self, item, spider):
""" Method in charge of item valdation and processing. """
if item['html'] and item['title'] and item['url']:
# insert item in mongo if not already present
if self.mongo.find_one({'title': item['title']}):
raise DropItem('Item already in db')
else:
self.mongo.insert(dict(item))
log.msg("Item %s scraped" % item['title'],
level=log.INFO, spider=spider)
else:
raise DropItem('Missing information on item %s' % (
'scraped from ' + item.get('url')
or item.get('title')))
return item
The proper way to do this looks to be to implement a custom LogFormatter for your project, and change the logging level of dropped items.
Example:
from scrapy import log
from scrapy import logformatter
class PoliteLogFormatter(logformatter.LogFormatter):
def dropped(self, item, exception, response, spider):
return {
'level': log.DEBUG,
'format': logformatter.DROPPEDFMT,
'exception': exception,
'item': item,
}
Then in your settings file, something like:
LOG_FORMATTER = 'apps.crawler.spiders.PoliteLogFormatter'
I had bad luck just returning "None," which caused exceptions in future pipelines.
In recent Scrapy versions, this has been changed a bit. I copied the code from @jimmytheleaf and fixed it to work with recent Scrapy:
import logging
from scrapy import logformatter
class PoliteLogFormatter(logformatter.LogFormatter):
def dropped(self, item, exception, response, spider):
return {
'level': logging.INFO,
'msg': logformatter.DROPPEDMSG,
'args': {
'exception': exception,
'item': item,
}
}
Ok, I found the answer before even posting the question. I still think that the answer might be valuable to anyone having the same problem.
Instead of dropping the object with a DropItem
exception, you just have to return a None value:
def process_item(self, item, spider):
""" Method in charge of item valdation and processing. """
if item['html'] and item['title'] and item['url']:
# insert item in mongo if not already present
if self.mongo.find_one({'url': item['url']}):
return
else:
self.mongo.insert(dict(item))
log.msg("Item %s scraped" % item['title'],
level=log.INFO, spider=spider)
else:
raise DropItem('Missing information on item %s' % (
'scraped from ' + item.get('url')
or item.get('title')))
return item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With