Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

django select_related - when to use it

I'm trying to optimize my ORM queries in django. I use connection.queries to view the queries that django generate for me.

Assuming I have these models:

class Book(models.Model):
    name   = models.CharField(max_length=50)
    author = models.ForeignKey(Author)

class Author(models.Model):
    name   = models.CharField(max_length=50)

Let's say, when I generate a specific webpage, I want to display all books, with their author name next to each of them. Also, I display seperately all the authors.

So should I use

Book.objects.all().select_related("author")

Which will result in a JOIN query. Even if I do a line before:

Author.objects.all()

Obviously in template I will write something like {{book.author.name}}.
So the question is, when I access a foreign key value (author), if django already has that object from another query, will that still result in additional query (for each book)? If no, so in that case, does using select_related actually creates performance overhead?

like image 803
user3599803 Avatar asked Oct 20 '15 07:10

user3599803


People also ask

Why we use Select_related in Django?

In Django, select_related and prefetch_related are designed to stop the deluge of database queries that are caused by accessing related objects. In this article, we will see how it reduces the number of queries and make the program much faster.

What's the difference between Prefetch_related and Select_related used to optimize number of DB queries?

The difference is that select_related does an SQL join and therefore gets the results back as part of the table from the SQL server. prefetch_related on the other hand executes another query and therefore reduces the redundant columns in the original object ( ModelA in the above example).

What is select related and Prefetch_related?

select_related and prefetch_related are ways to optimize database querying patterns when accessing related items. Both works on the same principle of prefetching data from the database, however, they take different approaches to do so. We will be using a decorator to print all the stats for the query.

How do I use Django ORM?

To do so, open the Django shell to run the query. You might be wonder how Django ORM makes our queries executed or what the corresponding query of the code we are writing. It is quite simple to get the SQL query, we need to use the str() and pass the queryset object along with query.


3 Answers

You are actually asking two different questions:

1. does using select_related actually creates performance overhead?

You should see documentation about Django Query Cache:

Understand QuerySet evaluation

To avoid performance problems, it is important to understand:

  • that QuerySets are lazy.

  • when they are evaluated.

  • how the data is held in memory.

So in summary, Django caches in memory results evaluated within the same QuerySet object, that is, if you do something like that:

books = Book.objects.all().select_related("author")
for book in books:
    print(book.author.name)  # Evaluates the query set, caches in memory results
first_book = books[1]  # Does not hit db
print(first_book.author.name)  # Does not hit db  

Will only hit db once as you prefetched Authors in select_related, all this stuff will result in a single database query with INNER JOIN.

BUT this won't do any cache between querysets, nor even with the same query:

books = Book.objects.all().select_related("author")
books2 = Book.objects.all().select_related("author")
first_book = books[1]  # Does hit db
first_book = books2[1]  # Does hit db

This is actually pointed out in docs:

We will assume you have done the obvious things above. The rest of this document focuses on how to use Django in such a way that you are not doing unnecessary work. This document also does not address other optimization techniques that apply to all expensive operations, such as general purpose caching.

2. if django already has that object from another query, will that still result in additional query (for each book)?

You are actually meaning if Django does ORM queries caching, which is a very different matter. ORM Queries caching, that is, if you do a query before and then you do the same query later, if database hasn't changed, the result is coming from a cache and not from an expensive database lookup.

The answer is not Django, not officially supported, but yes unofficially, yes through 3rd-party apps. The most relevant third-party apps that enables this type of caching are:

  1. Johnny-Cache (older, not supporting django>1.6)
  2. Django-Cachalot (newer, supports 1.6, 1.7, and still in dev 1.8)
  3. Django-Cacheops (newer, supports Python 2.7 or 3.3+, Django 1.8+ and Redis 2.6+ (4.0+ recommended))

Take a look a those if you look for query caching and remember, first profile, find bottlenecks, and if they are causing a problem then optimize.

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming. Donald Knuth.

like image 75
danius Avatar answered Sep 29 '22 16:09

danius


Django doesn't know about other queries! Author.objects.all() and Book.objects.all() are totally different querysets. So if have both in your view and pass them to template context but in your template you do something like:

{% for book in books %}
  {{ book.author.name }}
{% endfor %}

and have N books this will result to N extra database queries (beyond the queries to get all books and authors) !

If instead you had done Book.objects.all().select_related("author") no extra queries will be done in the above template snippet.

Now, select_related() of course adds some overhead to the queries. What happens is that when you do a Book.objects.all() django will return the result of SELECT * FROM BOOKS. If instead you do a Book.objects.all().select_related("author") django will return the result of SELECT * FROM BOOKS B LEFT JOIN AUTHORS A ON B.AUTHOR_ID = A.ID. So for each book it will return both the columns of the book and its corresponding author. However, this overhead is really much smaller when compared to the overhead of hitting the database N times (as explained before).

So, even though select_related creates a small performance overhead (each query returns more fields from the database) it will actually be beneficial to use it except when you are totally sure that you'll need only the columns of the specific model you are querying.

Finally, a great way to really see how many (and which exactly) queries are actuall exectuted in your database is to use django-debug-tooblar (https://github.com/django-debug-toolbar/django-debug-toolbar).

like image 27
Serafeim Avatar answered Sep 29 '22 15:09

Serafeim


Book.objects.select_related("author")

is good enough. No need for Author.objects.all()

{{ book.author.name }}

won't hit the database, because book.author has been prepopulated already.

like image 6
doniyor Avatar answered Sep 29 '22 17:09

doniyor