Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Silently drop an item

Tags:

python

scrapy

I am using Scrapy to crawl several websites, which may share redundant information.

For each page I scrape, I store the url of the page, its title and its html code, into mongoDB. I want to avoid duplication in database, thus, I implement a pipeline in order to check if a similar item is already stored. In such a case, I raise a DropItem exception.

My problem is that whenever I drop an item by raison a DropItem exception, Scrapy will display the entire content of the item into the log (stdout or file). As I'm extracting the entire HTML code of each scraped page, in case of a drop, the whole HTML code will be displayed into the log.

How could I silently drop an item without its content being shown?

Thank you for your time!

class DatabaseStorage(object):
    """ Pipeline in charge of database storage.

    The 'whole' item (with HTML and text) will be stored in mongoDB.
    """

    def __init__(self):
        self.mongo = MongoConnector().collection

    def process_item(self, item, spider):
        """ Method in charge of item valdation and processing. """
        if item['html'] and item['title'] and item['url']:
            # insert item in mongo if not already present
            if self.mongo.find_one({'title': item['title']}):
                raise DropItem('Item already in db')
            else:
                self.mongo.insert(dict(item))
                log.msg("Item %s scraped" % item['title'],
                    level=log.INFO, spider=spider)
        else:
            raise DropItem('Missing information on item %s' % (
                'scraped from ' + item.get('url')
                or item.get('title')))
        return item
like image 219
Balthazar Rouberol Avatar asked Nov 23 '12 11:11

Balthazar Rouberol


3 Answers

The proper way to do this looks to be to implement a custom LogFormatter for your project, and change the logging level of dropped items.

Example:

from scrapy import log
from scrapy import logformatter

class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': log.DEBUG,
            'format': logformatter.DROPPEDFMT,
            'exception': exception,
            'item': item,
        }

Then in your settings file, something like:

LOG_FORMATTER = 'apps.crawler.spiders.PoliteLogFormatter'

I had bad luck just returning "None," which caused exceptions in future pipelines.

like image 191
jimmytheleaf Avatar answered Oct 18 '22 07:10

jimmytheleaf


In recent Scrapy versions, this has been changed a bit. I copied the code from @jimmytheleaf and fixed it to work with recent Scrapy:

import logging
from scrapy import logformatter


class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': logging.INFO,
            'msg': logformatter.DROPPEDMSG,
            'args': {
                'exception': exception,
                'item': item,
            }
        }
like image 21
mirosval Avatar answered Oct 18 '22 07:10

mirosval


Ok, I found the answer before even posting the question. I still think that the answer might be valuable to anyone having the same problem.

Instead of dropping the object with a DropItem exception, you just have to return a None value:

def process_item(self, item, spider):
    """ Method in charge of item valdation and processing. """
    if item['html'] and item['title'] and item['url']:
        # insert item in mongo if not already present
        if self.mongo.find_one({'url': item['url']}):
            return
        else:
            self.mongo.insert(dict(item))
            log.msg("Item %s scraped" % item['title'],
                level=log.INFO, spider=spider)
    else:
        raise DropItem('Missing information on item %s' % (
           'scraped from ' + item.get('url')
            or item.get('title')))
    return item
like image 12
Balthazar Rouberol Avatar answered Oct 18 '22 08:10

Balthazar Rouberol