Summary: I'm getting very slow queries using few queries and annotate vs. two queries extra per item when counting related objects. Database is PostgreSQL 9.3.5. <hr> I have a model that looks something like this: <pre class="prettyprint"><code>class Collection(models.Model): have = models.ManyToManyField(Item, related_name='item_have', through='Have') want = models.ManyToManyField(Item, related_name='item_want', through='Want') added = models.DateTimeField() class Meta: ordering = ['-last_bump'] class Have(models.Model): item = models.ForeignKey(Item) collection = models.ForeignKey(Collection, related_name='have_set') price = models.IntegerField(default=0) class Want(models.Model): want = models.ForeignKey(Item) collection = models.ForeignKey(Collection, related_name='want_set') price = models.IntegerField(default=0) </code></pre> And in my view, I list these Collections, and I want to show a count of how many wants and haves there are in each of them, doing that by doing an annotate: <pre class="prettyprint"><code>class ListView(generic.ListView): model = Collection queryset = Collection.objects.select_related() paginate_by = 20 def get_queryset(self): queryset = super(ListView, self).get_queryset() queryset = queryset.annotate(have_count=Count("have", distinct=True), want_count=Count("want", distinct=True)) </code></pre> This, however, makes my query very slow! I have about 650 records in the DB and django-debug-toolbar says it makes 2 queries and averaging around 400-500ms. I've tried with prefetch_related, but it doesn't make it any quicker. I did try another thing, in the Collection model, I added this: <pre class="prettyprint"><code>@property def have_count(self): return self.have.count() @property def want_count(self): return self.want.count() </code></pre> and removed the annotate from my view. With this instead, it makes a total of 42 queries to the database, but it's done in 20-25ms. What am I doing wrong with my annotation here? Shouldn't it be faster to do the count in one query, vs doing many count queries?

Why it is slow: If you simply used the annotation by two ManyToMany fields then you create an unwanted big join of all these tables together. The size of the Cartesian product of rows that must be evaluated is approximately <code>Have.objects.count() * Want.objects.count()</code>. You wrote then <code>distinct=True</code> to restrict finally the number of duplicated items to not get an invalid huge result. Fix for old Django: If you would use only <code>queryset.annotate(have_count=Count("have"))</code> you will get the right result fast without <code>distinct=True</code> or the same result also fast with distinct. Then you can to combine results of two queries by Python in memory. <hr> Solution A nice solution is possible in Django >= 1.11 (two years after your question) by use a query with two subqueries, one for <code>Have</code> and one for <code>Want</code>, all by one request, but not to mix all tables together. <pre class="prettyprint lang-py prettyprint-override"><code>from django.db.models import Count, OuterRef, Subquery sq = Collection.objects.filter(pk=OuterRef('pk')).order_by() have_count_subq = sq.values('have').annotate(have_count=Count('have')).values('have_count') want_count_subq = sq.values('want').annotate(have_count=Count('want')).values('want_count') queryset = queryset.annotate(have_count=Subquery(have_count_subq), want_count=Subquery(want_count_subq)) </code></pre> <hr> Verify: You can check both the slow and the fixed SQL query by printing <code>str(my_queryset.query)</code> that it is as described above.

Why is this Django (1.6) annotate count so slow?

Tags:

python

postgresql

django

Summary: I'm getting very slow queries using few queries and annotate vs. two queries extra per item when counting related objects. Database is PostgreSQL 9.3.5.

I have a model that looks something like this:

class Collection(models.Model):
    have  = models.ManyToManyField(Item, related_name='item_have', through='Have')
    want  = models.ManyToManyField(Item, related_name='item_want', through='Want')
    added = models.DateTimeField()

    class Meta:
        ordering = ['-last_bump']

class Have(models.Model):
    item       = models.ForeignKey(Item)
    collection = models.ForeignKey(Collection, related_name='have_set')
    price      = models.IntegerField(default=0)

class Want(models.Model):
    want       = models.ForeignKey(Item)
    collection = models.ForeignKey(Collection, related_name='want_set')
    price      = models.IntegerField(default=0)

And in my view, I list these Collections, and I want to show a count of how many wants and haves there are in each of them, doing that by doing an annotate:

class ListView(generic.ListView):
    model = Collection
    queryset = Collection.objects.select_related()
    paginate_by = 20

    def get_queryset(self):
        queryset = super(ListView, self).get_queryset()
        queryset = queryset.annotate(have_count=Count("have", distinct=True),
                                     want_count=Count("want", distinct=True))

This, however, makes my query very slow! I have about 650 records in the DB and django-debug-toolbar says it makes 2 queries and averaging around 400-500ms. I've tried with prefetch_related, but it doesn't make it any quicker.

I did try another thing, in the Collection model, I added this:

@property
def have_count(self):
    return self.have.count()

@property
def want_count(self):
    return self.want.count()

and removed the annotate from my view. With this instead, it makes a total of 42 queries to the database, but it's done in 20-25ms.

What am I doing wrong with my annotation here? Shouldn't it be faster to do the count in one query, vs doing many count queries?

698

asked Mar 22 '15 14:03

Christoffer Karlsson

1 Answers

Why it is slow: If you simply used the annotation by two ManyToMany fields then you create an unwanted big join of all these tables together. The size of the Cartesian product of rows that must be evaluated is approximately Have.objects.count() * Want.objects.count(). You wrote then distinct=True to restrict finally the number of duplicated items to not get an invalid huge result.

Fix for old Django: If you would use only queryset.annotate(have_count=Count("have")) you will get the right result fast without distinct=True or the same result also fast with distinct. Then you can to combine results of two queries by Python in memory.

Solution A nice solution is possible in Django >= 1.11 (two years after your question) by use a query with two subqueries, one for Have and one for Want, all by one request, but not to mix all tables together.

from django.db.models import Count, OuterRef, Subquery

sq = Collection.objects.filter(pk=OuterRef('pk')).order_by()
have_count_subq = sq.values('have').annotate(have_count=Count('have')).values('have_count')
want_count_subq = sq.values('want').annotate(have_count=Count('want')).values('want_count')
queryset = queryset.annotate(have_count=Subquery(have_count_subq),
                             want_count=Subquery(want_count_subq))

Verify: You can check both the slow and the fixed SQL query by printing str(my_queryset.query) that it is as described above.

160

answered Sep 27 '22 20:09

hynekcer

Related questions
                            
                                Pandas plot with errorbar: style does not apply
                            
                                Python: Constant Class
                            
                                What's the meaning of __PYVENV_LAUNCHER__ environment variable?
                            
                                How to organize GAE Modules app structure and code?
                            
                                How to enable logging of django rest api CRUD operations in django_admin_log?
                            
                                How to get hold of the object missing an attribute
                            
                                Celery + RabbitMQ + "A socket error ocurred"
                            
                                TypeError: object() takes no parameters - but only in Python 3
                            
                                Splinter or Selenium: Can we get current html page after clicking a button?
                            
                                Python app configuration best practices
                            
                                Is it a bug of design of OpenCV's function "pyrDown"
                            
                                Sublime Text remove python new property autocomplete
                            
                                matplotlib prune tick labels
                            
                                Adding external libraries in PyCharm Professional 4
                            
                                Understanding gc.get_referrers
                            
                                Save Apache Spark mllib model in python [duplicate]
                            
                                Python slice without copy? [duplicate]
                            
                                Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"
                            
                                Multiprocessing python not running in parallel
                            
                                Why it failed without import in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With