Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance optimization on Django update or create

In a Django project, I'm refreshing tens of thousands of lines of data from an external API on a daily basis. The problem is that since I don't know if the data is new or just an update, I can't do a bulk_create operation.

Note: Some, or perhaps many, of the rows, do not actually change on a daily basis, but I don't which, or how many, ahead of time.

So for now I do:

for row in csv_data:
    try:
        MyModel.objects.update_or_create(id=row['id'], defaults={'field1': row['value1']....})
    except:
        print 'error!'

And it takes.... forever! One or two lines a second, max speed, sometimes several seconds per line. Each model I'm refreshing has one or more other models connected to it through a foreign key, so I can't just delete them all and reinsert every day. I can't wrap my head around this one -- how can I cut down significantly the number of database operations so the refresh doesn't take hours and hours.

Thanks for any help.

like image 264
ygesher Avatar asked Feb 10 '16 11:02

ygesher


People also ask

What tool would you use in Django to do benchmarking for your APIs?

- [Instructor] Apache Bench is used as a benchmarking tool to let you know how many requests your Apache web server can handle. However, it can be used to test any other web server too. In one terminal, we're going to be running the Django web server.

Why Django ORM is slow?

Django's ORM is fantastic. It's slow because it chooses to be convenient but if it needs to be fast it's just a few slight API calls away. If you're curious, check out the code on Github.

How to make a high-performance Django website?

To be able to have a high-performance Django website, you must follow many different best practices to improve many of the various aspects of building your application in Django. This usually mainly focuses on two things: Database Query Optimization Asset Bundling and Image Optimization Improving the Performance of This Website

How to optimize Django templates for better performance?

The Cache Template loader of Django is another weapon that can be used for optimization. You can enable Cache Template Loader and it will save all of your time and bandwidth to bring the same templates now and then. For instance, some templates are frequently used and if you enable Cache Template Loader, those templates will be saved.

Is Django slowing down your application?

Django is very easy to use and flexible. This flexibility slows down the application. In web-application development, performance optimization is an endless yet quintessential task. This not only helps in increasing the number of users but also helps in increasing the usability of the applications. Today, we will discuss Django Performance Tips.

Is there any Django ORM that could update_or_create?

If you have used get_or_create you must have had the thought of if there is any Django ORM that could update_or_create. Well, Django has the exact implementation which can either update the DB object if present, else it would create a new object.


2 Answers

The problem is you are doing a database action on each data row you grabbed from the api. You can avoid doing that by understanding which of the rows are new (and do a bulk insert to all new rows), Which of the rows actually need update, and which didn't change. To elaborate:

  1. grab all the relevant rows from the database (meaning all the rows that can possibly be updated)
old_data = MyModel.objects.all() # if possible than do MyModel.objects.filter(...)
  1. Grab all the api data you need to insert or update
api_data = [...]
  1. for each row of data understand if its new and put it in array, or determine if the row needs to update the DB
    for row in api_data:
        if is_new_row(row, old_data):
            new_rows_array.append(row)
        else:
            if is_data_modified(row, old_data):
                ...
                # do the update
            else:
                continue
     MyModel.objects.bulk_create(new_rows_array)

is_new_row - will understand if the row is new and add it to an array that will be bulk created

is_data_modified - will look for the row in the old data and understand if the data of that row is changed and will update only if its changed

like image 63
Yariv Katz Avatar answered Nov 03 '22 21:11

Yariv Katz


If you look at the source code for update_or_create(), you'll see that it's hitting the database multiple times for each call (either a get() followed by a save(), or a get() followed by a create()). It does things this way to maximize internal consistency - for example, this ensures that your model's save() method is called in either case.

But you might well be able to do better, depending on your specific models and the nature of your data. For example, if you don't have a custom save() method, aren't relying on signals, and know that most of your incoming data maps to existing rows, you could instead try an update() followed by a bulk_create() if the row doesn't exist. Leaving aside related models, that would result in one query in most cases, and two queries at the most. Something like:

updated = MyModel.objects.filter(field1="stuff").update(field2="other")
if not updated:
    MyModel.objects.bulk_create([MyModel(field1="stuff", field2="other")])

(Note that this simplified example has a race condition, see the Django source for how to deal with it.)

In the future there will probably be support for PostgreSQL's UPSERT functionality, but of course that won't help you now.

Finally, as mentioned in the comment above, the slowness might just be a function of your database structure and not anything Django-specific.

like image 22
Kevin Christopher Henry Avatar answered Nov 03 '22 20:11

Kevin Christopher Henry