Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this Django (1.6) annotate count so slow?

Summary: I'm getting very slow queries using few queries and annotate vs. two queries extra per item when counting related objects. Database is PostgreSQL 9.3.5.


I have a model that looks something like this:

class Collection(models.Model):
    have  = models.ManyToManyField(Item, related_name='item_have', through='Have')
    want  = models.ManyToManyField(Item, related_name='item_want', through='Want')
    added = models.DateTimeField()

    class Meta:
        ordering = ['-last_bump']

class Have(models.Model):
    item       = models.ForeignKey(Item)
    collection = models.ForeignKey(Collection, related_name='have_set')
    price      = models.IntegerField(default=0)

class Want(models.Model):
    want       = models.ForeignKey(Item)
    collection = models.ForeignKey(Collection, related_name='want_set')
    price      = models.IntegerField(default=0)

And in my view, I list these Collections, and I want to show a count of how many wants and haves there are in each of them, doing that by doing an annotate:

class ListView(generic.ListView):
    model = Collection
    queryset = Collection.objects.select_related()
    paginate_by = 20

    def get_queryset(self):
        queryset = super(ListView, self).get_queryset()
        queryset = queryset.annotate(have_count=Count("have", distinct=True),
                                     want_count=Count("want", distinct=True))

This, however, makes my query very slow! I have about 650 records in the DB and django-debug-toolbar says it makes 2 queries and averaging around 400-500ms. I've tried with prefetch_related, but it doesn't make it any quicker.

I did try another thing, in the Collection model, I added this:

@property
def have_count(self):
    return self.have.count()

@property
def want_count(self):
    return self.want.count()

and removed the annotate from my view. With this instead, it makes a total of 42 queries to the database, but it's done in 20-25ms.

What am I doing wrong with my annotation here? Shouldn't it be faster to do the count in one query, vs doing many count queries?

like image 698
Christoffer Karlsson Avatar asked Mar 22 '15 14:03

Christoffer Karlsson


People also ask

Why Django queries are slow?

Now django queries are not executed until they very much have to. That is to say, if you're experiencing slowness after the first line, the problem is somewhere in the creation of the query which would suggest problems with the object manager.

Why Django Querysets are lazy?

This is because a Django QuerySet is a lazy object. It contains all of the information it needs to populate itself from the database, but will not actually do so until the information is needed.

What is difference between annotate and aggregate Django?

In the Django framework, both annotate and aggregate are responsible for identifying a given value set summary. Among these, annotate identifies the summary from each of the items in the queryset. Whereas in the case of aggregate, the summary is calculated for the entire queryset.

How do you count in Django?

Use Django's count() QuerySet method — simply append count() to the end of the appropriate QuerySet. Generate an aggregate over the QuerySet — Aggregation is when you "retrieve values that are derived by summarizing or aggregating a collection of objects." Ref: Django Aggregation Documentation.


1 Answers

Why it is slow: If you simply used the annotation by two ManyToMany fields then you create an unwanted big join of all these tables together. The size of the Cartesian product of rows that must be evaluated is approximately Have.objects.count() * Want.objects.count(). You wrote then distinct=True to restrict finally the number of duplicated items to not get an invalid huge result.

Fix for old Django: If you would use only queryset.annotate(have_count=Count("have")) you will get the right result fast without distinct=True or the same result also fast with distinct. Then you can to combine results of two queries by Python in memory.


Solution A nice solution is possible in Django >= 1.11 (two years after your question) by use a query with two subqueries, one for Have and one for Want, all by one request, but not to mix all tables together.

from django.db.models import Count, OuterRef, Subquery

sq = Collection.objects.filter(pk=OuterRef('pk')).order_by()
have_count_subq = sq.values('have').annotate(have_count=Count('have')).values('have_count')
want_count_subq = sq.values('want').annotate(have_count=Count('want')).values('want_count')
queryset = queryset.annotate(have_count=Subquery(have_count_subq),
                             want_count=Subquery(want_count_subq))

Verify: You can check both the slow and the fixed SQL query by printing str(my_queryset.query) that it is as described above.

like image 160
hynekcer Avatar answered Sep 27 '22 20:09

hynekcer