Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django batching/bulk update_or_create?

I have data in the database which needs updating peridocially. The source of the data returns everything that's avalible at that point in time, so will include new data that is not already in the database.

As I loop through the source data I don't want to be making 1000s of individual writes if possible.

Is there anything such as update_or_create but works in batches?

One thought was using update_or_create in combination with manual transactions, but I'm not sure if that just queues up the individual writes or if it would combine it all into one SQL insert?

Or similarly could using @commit_on_success() on a function with update_or_create inside a the loop work?

I am not doing anything with the data other than translating it and saving it to a model. Nothing is dependant on that model existing during the loop

like image 975
binarysmacker Avatar asked Nov 20 '14 19:11

binarysmacker


1 Answers

Since Django added support for bulk_update, this is now somewhat possible, though you need to do 3 database calls (a get, a bulk create, and a bulk update) per batch. It's a bit challenging to make a good interface to a general purpose function here, as you want the function to support both efficient querying as well as the updates. Here is a method I implemented that is designed for bulk update_or_create where you have a number of common identifying keys (which could be empty) and one identifying key that varies among the batch.

This is implemented as a method on a base model, but can be used independently of that. This also assumes that the base model has an auto_now timestamp on the model named updated_on; if this is not the case, the lines of the code that assume this have been commented for easy modification.

In order to use this in batches, chunk your updates into batches before calling it. This is also a way to get around data that can have one of a small number of values for a secondary identifier without having to change the interface.

class BaseModel(models.Model):     updated_on = models.DateTimeField(auto_now=True)          @classmethod     def bulk_update_or_create(cls, common_keys, unique_key_name, unique_key_to_defaults):         """         common_keys: {field_name: field_value}         unique_key_name: field_name         unique_key_to_defaults: {field_value: {field_name: field_value}}                  ex. Event.bulk_update_or_create(             {"organization": organization}, "external_id", {1234: {"started": True}}         )         """         with transaction.atomic():             filter_kwargs = dict(common_keys)             filter_kwargs[f"{unique_key_name}__in"] = unique_key_to_defaults.keys()             existing_objs = {                 getattr(obj, unique_key_name): obj                 for obj in cls.objects.filter(**filter_kwargs).select_for_update()             }                          create_data = {                 k: v for k, v in unique_key_to_defaults.items() if k not in existing_objs             }             for unique_key_value, obj in create_data.items():                 obj[unique_key_name] = unique_key_value                 obj.update(common_keys)             creates = [cls(**obj_data) for obj_data in create_data.values()]             if creates:                 cls.objects.bulk_create(creates)              # This set should contain the name of the `auto_now` field of the model             update_fields = {"updated_on"}             updates = []             for key, obj in existing_objs.items():                 obj.update(unique_key_to_defaults[key], save=False)                 update_fields.update(unique_key_to_defaults[key].keys())                 updates.append(obj)             if existing_objs:                 cls.objects.bulk_update(updates, update_fields)         return len(creates), len(updates)      def update(self, update_dict=None, save=True, **kwargs):         """ Helper method to update objects """         if not update_dict:             update_dict = kwargs         # This set should contain the name of the `auto_now` field of the model         update_fields = {"updated_on"}         for k, v in update_dict.items():             setattr(self, k, v)             update_fields.add(k)         if save:             self.save(update_fields=update_fields) 

Example usage:

class Event(BaseModel):     organization = models.ForeignKey(Organization)     external_id = models.IntegerField()     started = models.BooleanField()   organization = Organization.objects.get(...) updates_by_external_id = {     1234: {"started": True},     2345: {"started": True},     3456: {"started": False}, } Event.bulk_update_or_create(     {"organization": organization}, "external_id", updates_by_external_id ) 
like image 192
Zags Avatar answered Oct 12 '22 20:10

Zags