While constructing a complexe <code>QuerySet</code> with several annotations, I ran into an issue that I could reproduce with the following simple setup. Here are the models: <pre class="prettyprint lang-py prettyprint-override"><code>class Player(models.Model): name = models.CharField(max_length=200) class Unit(models.Model): player = models.ForeignKey(Player, on_delete=models.CASCADE, related_name='unit_set') rarity = models.IntegerField() class Weapon(models.Model): unit = models.ForeignKey(Unit, on_delete=models.CASCADE, related_name='weapon_set') </code></pre> With my test database, I obtain the following (correct) results: <pre class="prettyprint lang-py prettyprint-override"><code>Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')) [{'id': 1, 'name': 'James', 'weapon_count': 23}, {'id': 2, 'name': 'Max', 'weapon_count': 41}, {'id': 3, 'name': 'Bob', 'weapon_count': 26}] Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')) [{'id': 1, 'name': 'James', 'rarity_sum': 42}, {'id': 2, 'name': 'Max', 'rarity_sum': 89}, {'id': 3, 'name': 'Bob', 'rarity_sum': 67}] </code></pre> If I now combine both annotations in the same <code>QuerySet</code>, I obtain a different (inaccurate) results: <pre class="prettyprint lang-py prettyprint-override"><code>Player.objects.annotate( weapon_count=Count('unit_set__weapon_set', distinct=True), rarity_sum=Sum('unit_set__rarity')) [{'id': 1, 'name': 'James', 'weapon_count': 23, 'rarity_sum': 99}, {'id': 2, 'name': 'Max', 'weapon_count': 41, 'rarity_sum': 183}, {'id': 3, 'name': 'Bob', 'weapon_count': 26, 'rarity_sum': 113}] </code></pre> Notice how <code>rarity_sum</code> have now different values than before. Removing <code>distinct=True</code> does not affect the result. I also tried to use the <code>DistinctSum</code> function from this answer, in which case all <code>rarity_sum</code> are set to <code>18</code> (also inaccurate). Why is this? How can I combine both annotations in the same <code>QuerySet</code>? Edit: here is the sqlite query generated by the combined QuerySet: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count", SUM("sandbox_unit"."rarity") AS "rarity_sum" FROM "sandbox_player" LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id") LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id") GROUP BY "sandbox_player"."id", "sandbox_player"."name" </code></pre> The data used for the results above is available here.

This isn't the problem with Django ORM, this is just the way relational databases work. When you're constructing simple querysets like <pre class="prettyprint"><code>Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')) </code></pre> or <pre class="prettyprint"><code>Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')) </code></pre> ORM does exactly what you expect it to do - join <code>Player</code> with <code>Weapon</code> <pre class="prettyprint"><code>SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT("sandbox_weapon"."id") AS "weapon_count" FROM "sandbox_player" LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id") LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id") GROUP BY "sandbox_player"."id", "sandbox_player"."name" </code></pre> or <code>Player</code> with <code>Unit</code> <pre class="prettyprint"><code>SELECT "sandbox_player"."id", "sandbox_player"."name", SUM("sandbox_unit"."rarity") AS "rarity_sum" FROM "sandbox_player" LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id") GROUP BY "sandbox_player"."id", "sandbox_player"."name" </code></pre> and perform either <code>COUNT</code> or <code>SUM</code> aggregation on them. Note that although the first query has two joins between three tables, the intermediate table <code>Unit</code> is neither in columns referenced in <code>SELECT</code>, nor in the <code>GROUP BY</code> clause. The only role that <code>Unit</code> plays here is to join <code>Player</code> with <code>Weapon</code>. Now if you look at your third queryset, things get more complicated. Again, as in the first query the joins are between three tables, but now <code>Unit</code> is referenced in <code>SELECT</code> as there is <code>SUM</code> aggregation for <code>Unit.rarity</code>: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count", SUM("sandbox_unit"."rarity") AS "rarity_sum" FROM "sandbox_player" LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id") LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id") GROUP BY "sandbox_player"."id", "sandbox_player"."name" </code></pre> And this is the crucial difference between the second and the third queries. In the second query, you're joining <code>Player</code> to <code>Unit</code>, so a single <code>Unit</code> will be listed once for each player that it references. But in the third query you're joining <code>Player</code> to <code>Unit</code> and then <code>Unit</code> to <code>Weapon</code>, so not only a single <code>Unit</code> will be listed once for each player that it references, but also for each weapon that references <code>Unit</code>. Let's take a look at the simple example: <pre class="prettyprint lang-sql prettyprint-override"><code>insert into sandbox_player values (1, "player_1"); insert into sandbox_unit values(1, 10, 1); insert into sandbox_weapon values (1, 1), (2, 1); </code></pre> One player, one unit and two weapons that reference the same unit. Confirm that the problem exists: <pre class="prettyprint"><code>>>> from sandbox.models import Player >>> from django.db.models import Count, Sum >>> Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).values() <QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2}]> >>> Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).values() <QuerySet [{'id': 1, 'name': 'player_1', 'rarity_sum': 10}]> >>> Player.objects.annotate( ... weapon_count=Count('unit_set__weapon_set', distinct=True), ... rarity_sum=Sum('unit_set__rarity')).values() <QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 20}]> </code></pre> From this example it's easy to see that the problem is that in the combined query the unit will be listed twice, one time for each of the weapons referencing it: <pre class="prettyprint"><code>sqlite> SELECT "sandbox_player"."id", ...> "sandbox_player"."name", ...> "sandbox_weapon"."id", ...> "sandbox_unit"."rarity" ...> FROM "sandbox_player" ...> LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id") ...> LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id"); id name id rarity ---------- ---------- ---------- ---------- 1 player_1 1 10 1 player_1 2 10 </code></pre> <h3>What should you do?</h3> As @ivissani mentioned, one of the easiest solutions would be to write subqueries for each of the aggregations: <pre class="prettyprint"><code>>>> from django.db.models import Count, IntegerField, OuterRef, Subquery, Sum >>> weapon_count = Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).filter(pk=OuterRef('pk')) >>> rarity_sum = Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).filter(pk=OuterRef('pk')) >>> qs = Player.objects.annotate( ... weapon_count=Subquery(weapon_count.values('weapon_count'), output_field=IntegerField()), ... rarity_sum=Subquery(rarity_sum.values('rarity_sum'), output_field=IntegerField()) ... ) >>> qs.values() <QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 10}]> </code></pre> which produces the following SQL <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT "sandbox_player"."id", "sandbox_player"."name", ( SELECT COUNT(U2."id") AS "weapon_count" FROM "sandbox_player" U0 LEFT OUTER JOIN "sandbox_unit" U1 ON (U0."id" = U1."player_id") LEFT OUTER JOIN "sandbox_weapon" U2 ON (U1."id" = U2."unit_id") WHERE U0."id" = ("sandbox_player"."id") GROUP BY U0."id", U0."name" ) AS "weapon_count", ( SELECT SUM(U1."rarity") AS "rarity_sum" FROM "sandbox_player" U0 LEFT OUTER JOIN "sandbox_unit" U1 ON (U0."id" = U1."player_id") WHERE U0."id" = ("sandbox_player"."id") GROUP BY U0."id", U0."name") AS "rarity_sum" FROM "sandbox_player" </code></pre>

Django Count and Sum annotations interfere with each other

Tags:

python

django

django-queryset

While constructing a complexe QuerySet with several annotations, I ran into an issue that I could reproduce with the following simple setup.

Here are the models:

class Player(models.Model):
    name = models.CharField(max_length=200)

class Unit(models.Model):
    player = models.ForeignKey(Player, on_delete=models.CASCADE,
                               related_name='unit_set')
    rarity = models.IntegerField()

class Weapon(models.Model):
    unit = models.ForeignKey(Unit, on_delete=models.CASCADE,
                             related_name='weapon_set')

With my test database, I obtain the following (correct) results:

Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))

[{'id': 1, 'name': 'James', 'weapon_count': 23},
 {'id': 2, 'name': 'Max', 'weapon_count': 41},
 {'id': 3, 'name': 'Bob', 'weapon_count': 26}]


Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))

[{'id': 1, 'name': 'James', 'rarity_sum': 42},
 {'id': 2, 'name': 'Max', 'rarity_sum': 89},
 {'id': 3, 'name': 'Bob', 'rarity_sum': 67}]

If I now combine both annotations in the same QuerySet, I obtain a different (inaccurate) results:

Player.objects.annotate(
    weapon_count=Count('unit_set__weapon_set', distinct=True),
    rarity_sum=Sum('unit_set__rarity'))

[{'id': 1, 'name': 'James', 'weapon_count': 23, 'rarity_sum': 99},
 {'id': 2, 'name': 'Max', 'weapon_count': 41, 'rarity_sum': 183},
 {'id': 3, 'name': 'Bob', 'weapon_count': 26, 'rarity_sum': 113}]

Notice how rarity_sum have now different values than before. Removing distinct=True does not affect the result. I also tried to use the DistinctSum function from this answer, in which case all rarity_sum are set to 18 (also inaccurate).

Why is this? How can I combine both annotations in the same QuerySet?

Edit: here is the sqlite query generated by the combined QuerySet:

SELECT "sandbox_player"."id",
       "sandbox_player"."name",
       COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
       SUM("sandbox_unit"."rarity")          AS "rarity_sum"
FROM "sandbox_player"
         LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
         LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

The data used for the results above is available here.

384

asked Jun 12 '19 17:06

abey

4 Answers

This isn't the problem with Django ORM, this is just the way relational databases work. When you're constructing simple querysets like

Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))

Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))

ORM does exactly what you expect it to do - join Player with Weapon

SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT("sandbox_weapon"."id") AS "weapon_count"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" 
    ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" 
    ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

or Player with Unit

SELECT "sandbox_player"."id", "sandbox_player"."name", SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

and perform either COUNT or SUM aggregation on them.

Note that although the first query has two joins between three tables, the intermediate table Unit is neither in columns referenced in SELECT, nor in the GROUP BY clause. The only role that Unit plays here is to join Player with Weapon.

Now if you look at your third queryset, things get more complicated. Again, as in the first query the joins are between three tables, but now Unit is referenced in SELECT as there is SUM aggregation for Unit.rarity:

SELECT "sandbox_player"."id",
       "sandbox_player"."name",
       COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
       SUM("sandbox_unit"."rarity")          AS "rarity_sum"
FROM "sandbox_player"
         LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
         LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

And this is the crucial difference between the second and the third queries. In the second query, you're joining Player to Unit, so a single Unit will be listed once for each player that it references.

But in the third query you're joining Player to Unit and then Unit to Weapon, so not only a single Unit will be listed once for each player that it references, but also for each weapon that references Unit.

Let's take a look at the simple example:

insert into sandbox_player values (1, "player_1");

insert into sandbox_unit values(1, 10, 1);

insert into sandbox_weapon values (1, 1), (2, 1);

One player, one unit and two weapons that reference the same unit.

Confirm that the problem exists:

>>> from sandbox.models import Player
>>> from django.db.models import Count, Sum

>>> Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2}]>

>>> Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'rarity_sum': 10}]>


>>> Player.objects.annotate(
...     weapon_count=Count('unit_set__weapon_set', distinct=True),
...     rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 20}]>

From this example it's easy to see that the problem is that in the combined query the unit will be listed twice, one time for each of the weapons referencing it:

sqlite> SELECT "sandbox_player"."id",
   ...>        "sandbox_player"."name",
   ...>        "sandbox_weapon"."id",
   ...>        "sandbox_unit"."rarity"
   ...> FROM "sandbox_player"
   ...>          LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
   ...>          LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id");
id          name        id          rarity    
----------  ----------  ----------  ----------
1           player_1    1           10        
1           player_1    2           10

What should you do?

As @ivissani mentioned, one of the easiest solutions would be to write subqueries for each of the aggregations:

>>> from django.db.models import Count, IntegerField, OuterRef, Subquery, Sum
>>> weapon_count = Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).filter(pk=OuterRef('pk'))
>>> rarity_sum = Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).filter(pk=OuterRef('pk'))
>>> qs = Player.objects.annotate(
...     weapon_count=Subquery(weapon_count.values('weapon_count'), output_field=IntegerField()),
...     rarity_sum=Subquery(rarity_sum.values('rarity_sum'), output_field=IntegerField())
... )
>>> qs.values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 10}]>

which produces the following SQL

SELECT "sandbox_player"."id", "sandbox_player"."name", 
(
    SELECT COUNT(U2."id") AS "weapon_count"
    FROM "sandbox_player" U0 
    LEFT OUTER JOIN "sandbox_unit" U1
        ON (U0."id" = U1."player_id")
    LEFT OUTER JOIN "sandbox_weapon" U2 
        ON (U1."id" = U2."unit_id")
    WHERE U0."id" = ("sandbox_player"."id") 
    GROUP BY U0."id", U0."name"
) AS "weapon_count", 
(
    SELECT SUM(U1."rarity") AS "rarity_sum"
    FROM "sandbox_player" U0
    LEFT OUTER JOIN "sandbox_unit" U1
        ON (U0."id" = U1."player_id")
    WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name") AS "rarity_sum"
FROM "sandbox_player"

answered Oct 19 '22 20:10

rktavi

A few notes to complement rktavi's excellent answer:

1) This issues has apparently been considered a bug for 10 years already. It is even referred to in the official documentation.

2) While converting my actual project's QuerySets to subqueries (as per rktavi's answer), I noticed that combining bare-bone annotations (for the distinct=True counts that always worked correctly) with a Subquery (for the sums) yields extremely long processing (35 sec vs. 100 ms) and incorrect results for the sum. This is true in my actual setup (11 filtered counts on various nested relations and 1 filtered sum on a multiply-nested relation, SQLite3) but cannot be reproduced with the simple models above. This issue can be tricky because another part of your code could add an annotation to your QuerySet (e.g a Table.order_FOO() function), leading to the issue.

3) With the same setup, I have anecdotical evidence that subquery-type QuerySets are faster compared to bare-bone annotation QuerySets (in cases where you have only distinct=True counts, of course). I could observe this both with local SQLite3 (83 ms vs 260 ms) and hosted PostgreSQL (320 ms vs 540 ms).

As a result of the above, I will completely avoid using bare-bone annotations in favour of subqueries.

answered Oct 19 '22 18:10

abey

Based on the excellent answer from @rktavi, I created two helpers classes that simplify the Subquery/Count and Subquery/Sum patterns:

class SubqueryCount(Subquery):
    template = "(SELECT count(*) FROM (%(subquery)s) _count)"
    output_field = PositiveIntegerField()


class SubquerySum(Subquery):
    template = '(SELECT sum(_sum."%(column)s") FROM (%(subquery)s) _sum)'

    def __init__(self, queryset, column, output_field=None, **extra):
        if output_field is None:
            output_field = queryset.model._meta.get_field(column)
        super().__init__(queryset, output_field, column=column, **extra)

One can use these helpers like so:

from django.db.models import OuterRef

weapons = Weapon.objects.filter(unit__player_id=OuterRef('id'))
units = Unit.objects.filter(player_id=OuterRef('id'))

qs = Player.objects.annotate(weapon_count=SubqueryCount(weapons),
                             rarity_sum=SubquerySum(units, 'rarity'))

answered Oct 19 '22 20:10

Benoit Blanchon

Thanks @rktavi for your amazing answer!!

Here's my use case:

Using Django DRF.

I needed to get Sum and Count from different FK's inside the annotate so that it would all be part of one queryset in order to add these fields to the ordering_fields in DRF.

The Sum and Count were clashing and returning wrong results. Your answer really helped me put it all together.

The annotate was occasionally returning the dates as strings, so I needed to Cast it to DateTimeField.

    donation_filter =  Q(payments__status='donated') & ~Q(payments__payment_type__payment_type='coupon')
    total_donated_SQ = User.objects.annotate(total_donated=Sum('payments__sum', filter=donation_filter )).filter(pk=OuterRef('pk'))
    message_count_SQ = User.objects.annotate(message_count=Count('events__id', filter=Q(events__event_id=6))).filter(pk=OuterRef('pk'))
    queryset = User.objects.annotate(
        total_donated=Subquery(total_donated_SQ.values('total_donated'), output_field=IntegerField()),
        last_donation_date=Cast(Max('payments__updated', filter=donation_filter ), output_field=DateTimeField()),
        message_count=Subquery(message_count_SQ.values('message_count'), output_field=IntegerField()),
        last_message_date=Cast(Max('events__updated', filter=Q(events__event_id=6)), output_field=DateTimeField())
    )

answered Oct 19 '22 19:10

Avramo

Related questions
                            
                                How to do OAuth-requiring operations in a GAE cron job?
                            
                                Getting tag after tag?
                            
                                Matplotlib figure changes for publication after making the figure
                            
                                Mocking pymongo in Flask
                            
                                How to implement non blocking socket server in python
                            
                                Remote debugging on PyCharm
                            
                                log file parsing python
                            
                                How can I get the name of a widget in Tkinter?
                            
                                Sequential or parallel: what is the proper way to read multiple files in python?
                            
                                Cannot install Cython on win7
                            
                                Merging multiple, unaligned data-frames into single pandas data-frame
                            
                                How do I interact with a child process pretending to be a terminal?
                            
                                How to match a string in a sentence
                            
                                How is the Tf-Idf value calculated with analyzer ='char'?
                            
                                Combining nearby bounding boxes along one axis
                            
                                Suppress y axis label in plotly subplot, annotation misalignment
                            
                                Roundtrip IPython transform_cell
                            
                                How can I set the field unique in django?
                            
                                expected two blank lines pep8 warning in python
                            
                                How to extract an element from a array in pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Django Count and Sum annotations interfere with each other

Tags:

python

django

django-queryset

abey

People also ask

4 Answers

What should you do?

rktavi

abey

Benoit Blanchon

Avramo

Recent Activity

Donate For Us