Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to update DjangoItem in Scrapy

I've been working with Scrapy but run into a bit of a problem.

DjangoItem has a save method to persist items using the Django ORM. This is great, except that if I run a scraper multiple times, new items will be created in the database even though I may just want to update a previous value.

After looking at the documentation and source code, I don't see any means to update existing items.

I know that I could call out to the ORM to see if an item exists and update it, but it would mean calling out to the database for every single object and then again to save the item.

How can I update items if they already exist?

like image 589
NT3RP Avatar asked May 14 '14 19:05

NT3RP


1 Answers

Unfortunately, the best way that I found to accomplish this is to do exactly what was stated: Check if the item exists in the database using django_model.objects.get, then update it if it does.

In my settings file, I added the new pipeline:

ITEM_PIPELINES = {
    # ...
    # Last pipeline, because further changes won't be saved.
    'apps.scrapy.pipelines.ItemPersistencePipeline': 999
}

I created some helper methods to handle the work of creating the item model, and creating a new one if necessary:

def item_to_model(item):
    model_class = getattr(item, 'django_model')
    if not model_class:
        raise TypeError("Item is not a `DjangoItem` or is misconfigured")

    return item.instance


def get_or_create(model):
    model_class = type(model)
    created = False

    # Normally, we would use `get_or_create`. However, `get_or_create` would
    # match all properties of an object (i.e. create a new object
    # anytime it changed) rather than update an existing object.
    #
    # Instead, we do the two steps separately
    try:
        # We have no unique identifier at the moment; use the name for now.
        obj = model_class.objects.get(name=model.name)
    except model_class.DoesNotExist:
        created = True
        obj = model  # DjangoItem created a model for us.

    return (obj, created)


def update_model(destination, source, commit=True):
    pk = destination.pk

    source_dict = model_to_dict(source)
    for (key, value) in source_dict.items():
        setattr(destination, key, value)

    setattr(destination, 'pk', pk)

    if commit:
        destination.save()

    return destination

Then, the final pipeline is fairly straightforward:

class ItemPersistencePipeline(object):
    def process_item(self, item, spider):
        try:
             item_model = item_to_model(item)
        except TypeError:
            return item

        model, created = get_or_create(item_model)

        update_model(model, item_model)

        return item
like image 89
NT3RP Avatar answered Oct 15 '22 01:10

NT3RP