Scrapy

Question

I am using Scrapy to crawl several websites, which may share redundant information.

For each page I scrape, I store the url of the page, its title and its html code, into mongoDB. I want to avoid duplication in database, thus, I implement a pipeline in order to check if a similar item is already stored. In such a case, I raise a DropItem exception.

My problem is that whenever I drop an item by raison a DropItem exception, Scrapy will display the entire content of the item into the log (stdout or file). As I'm extracting the entire HTML code of each scraped page, in case of a drop, the whole HTML code will be displayed into the log.

How could I silently drop an item without its content being shown?

Thank you for your time!

class DatabaseStorage(object):
    """ Pipeline in charge of database storage.

    The 'whole' item (with HTML and text) will be stored in mongoDB.
    """

    def __init__(self):
        self.mongo = MongoConnector().collection

    def process_item(self, item, spider):
        """ Method in charge of item valdation and processing. """
        if item['html'] and item['title'] and item['url']:
            # insert item in mongo if not already present
            if self.mongo.find_one({'title': item['title']}):
                raise DropItem('Item already in db')
            else:
                self.mongo.insert(dict(item))
                log.msg("Item %s scraped" % item['title'],
                    level=log.INFO, spider=spider)
        else:
            raise DropItem('Missing information on item %s' % (
                'scraped from ' + item.get('url')
                or item.get('title')))
        return item

jimmytheleaf · Accepted Answer

The proper way to do this looks to be to implement a custom LogFormatter for your project, and change the logging level of dropped items.

Example:

from scrapy import log
from scrapy import logformatter

class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': log.DEBUG,
            'format': logformatter.DROPPEDFMT,
            'exception': exception,
            'item': item,
        }

Then in your settings file, something like:

LOG_FORMATTER = 'apps.crawler.spiders.PoliteLogFormatter'

I had bad luck just returning "None," which caused exceptions in future pipelines.

mirosval · Answer

In recent Scrapy versions, this has been changed a bit. I copied the code from @jimmytheleaf and fixed it to work with recent Scrapy:

import logging
from scrapy import logformatter


class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': logging.INFO,
            'msg': logformatter.DROPPEDMSG,
            'args': {
                'exception': exception,
                'item': item,
            }
        }

Balthazar Rouberol · Answer

Ok, I found the answer before even posting the question. I still think that the answer might be valuable to anyone having the same problem.

Instead of dropping the object with a DropItem exception, you just have to return a None value:

def process_item(self, item, spider):
    """ Method in charge of item valdation and processing. """
    if item['html'] and item['title'] and item['url']:
        # insert item in mongo if not already present
        if self.mongo.find_one({'url': item['url']}):
            return
        else:
            self.mongo.insert(dict(item))
            log.msg("Item %s scraped" % item['title'],
                level=log.INFO, spider=spider)
    else:
        raise DropItem('Missing information on item %s' % (
           'scraped from ' + item.get('url')
            or item.get('title')))
    return item

Scrapy - Silently drop an item

Tags:

python

Balthazar Rouberol

3 Answers

jimmytheleaf

mirosval

Balthazar Rouberol

Recent Activity

Donate For Us

Scrapy - Silently drop an item

Tags:

python

scrapy

Balthazar Rouberol

3 Answers

jimmytheleaf

mirosval

Balthazar Rouberol

Related questions

Recent Activity

Donate For Us