I've been working with Scrapy but run into a bit of a problem.
DjangoItem
has a save
method to persist items using the Django ORM. This is great, except that if I run a scraper multiple times, new items will be created in the database even though I may just want to update a previous value.
After looking at the documentation and source code, I don't see any means to update existing items.
I know that I could call out to the ORM to see if an item exists and update it, but it would mean calling out to the database for every single object and then again to save the item.
How can I update items if they already exist?
Unfortunately, the best way that I found to accomplish this is to do exactly what was stated: Check if the item exists in the database using django_model.objects.get
, then update it if it does.
In my settings file, I added the new pipeline:
ITEM_PIPELINES = {
# ...
# Last pipeline, because further changes won't be saved.
'apps.scrapy.pipelines.ItemPersistencePipeline': 999
}
I created some helper methods to handle the work of creating the item model, and creating a new one if necessary:
def item_to_model(item):
model_class = getattr(item, 'django_model')
if not model_class:
raise TypeError("Item is not a `DjangoItem` or is misconfigured")
return item.instance
def get_or_create(model):
model_class = type(model)
created = False
# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.
#
# Instead, we do the two steps separately
try:
# We have no unique identifier at the moment; use the name for now.
obj = model_class.objects.get(name=model.name)
except model_class.DoesNotExist:
created = True
obj = model # DjangoItem created a model for us.
return (obj, created)
def update_model(destination, source, commit=True):
pk = destination.pk
source_dict = model_to_dict(source)
for (key, value) in source_dict.items():
setattr(destination, key, value)
setattr(destination, 'pk', pk)
if commit:
destination.save()
return destination
Then, the final pipeline is fairly straightforward:
class ItemPersistencePipeline(object):
def process_item(self, item, spider):
try:
item_model = item_to_model(item)
except TypeError:
return item
model, created = get_or_create(item_model)
update_model(model, item_model)
return item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With