Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django Count and Sum annotations interfere with each other

While constructing a complexe QuerySet with several annotations, I ran into an issue that I could reproduce with the following simple setup.

Here are the models:

class Player(models.Model):
    name = models.CharField(max_length=200)

class Unit(models.Model):
    player = models.ForeignKey(Player, on_delete=models.CASCADE,
                               related_name='unit_set')
    rarity = models.IntegerField()

class Weapon(models.Model):
    unit = models.ForeignKey(Unit, on_delete=models.CASCADE,
                             related_name='weapon_set')

With my test database, I obtain the following (correct) results:

Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))

[{'id': 1, 'name': 'James', 'weapon_count': 23},
 {'id': 2, 'name': 'Max', 'weapon_count': 41},
 {'id': 3, 'name': 'Bob', 'weapon_count': 26}]


Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))

[{'id': 1, 'name': 'James', 'rarity_sum': 42},
 {'id': 2, 'name': 'Max', 'rarity_sum': 89},
 {'id': 3, 'name': 'Bob', 'rarity_sum': 67}]

If I now combine both annotations in the same QuerySet, I obtain a different (inaccurate) results:

Player.objects.annotate(
    weapon_count=Count('unit_set__weapon_set', distinct=True),
    rarity_sum=Sum('unit_set__rarity'))

[{'id': 1, 'name': 'James', 'weapon_count': 23, 'rarity_sum': 99},
 {'id': 2, 'name': 'Max', 'weapon_count': 41, 'rarity_sum': 183},
 {'id': 3, 'name': 'Bob', 'weapon_count': 26, 'rarity_sum': 113}]

Notice how rarity_sum have now different values than before. Removing distinct=True does not affect the result. I also tried to use the DistinctSum function from this answer, in which case all rarity_sum are set to 18 (also inaccurate).

Why is this? How can I combine both annotations in the same QuerySet?

Edit: here is the sqlite query generated by the combined QuerySet:

SELECT "sandbox_player"."id",
       "sandbox_player"."name",
       COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
       SUM("sandbox_unit"."rarity")          AS "rarity_sum"
FROM "sandbox_player"
         LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
         LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

The data used for the results above is available here.

like image 384
abey Avatar asked Jun 12 '19 17:06

abey


People also ask

What is the difference between aggregate and annotate in Django?

Aggregate calculates values for the entire queryset. Annotate calculates summary values for each item in the queryset.

How does annotate work in Django?

In the Django framework, both annotate and aggregate are responsible for identifying a given value set summary. Among these, annotate identifies the summary from each of the items in the queryset. Whereas in the case of aggregate, the summary is calculated for the entire queryset.

How does Django count data?

You can either use Python's len() or use the count() method on any queryset depending on your requirements. Also note, using len() will evaluate the queryset so it's always feasible to use the provided count() method. You should also go through the QuerySet API Documentation for more information.

What is objects all () Django?

The Manager is the main source of QuerySets for a model. For example, Blog.objects.all() returns a QuerySet that contains all Blog objects in the database.


4 Answers

This isn't the problem with Django ORM, this is just the way relational databases work. When you're constructing simple querysets like

Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))

or

Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))

ORM does exactly what you expect it to do - join Player with Weapon

SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT("sandbox_weapon"."id") AS "weapon_count"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" 
    ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" 
    ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

or Player with Unit

SELECT "sandbox_player"."id", "sandbox_player"."name", SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

and perform either COUNT or SUM aggregation on them.

Note that although the first query has two joins between three tables, the intermediate table Unit is neither in columns referenced in SELECT, nor in the GROUP BY clause. The only role that Unit plays here is to join Player with Weapon.

Now if you look at your third queryset, things get more complicated. Again, as in the first query the joins are between three tables, but now Unit is referenced in SELECT as there is SUM aggregation for Unit.rarity:

SELECT "sandbox_player"."id",
       "sandbox_player"."name",
       COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
       SUM("sandbox_unit"."rarity")          AS "rarity_sum"
FROM "sandbox_player"
         LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
         LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

And this is the crucial difference between the second and the third queries. In the second query, you're joining Player to Unit, so a single Unit will be listed once for each player that it references.

But in the third query you're joining Player to Unit and then Unit to Weapon, so not only a single Unit will be listed once for each player that it references, but also for each weapon that references Unit.

Let's take a look at the simple example:

insert into sandbox_player values (1, "player_1");

insert into sandbox_unit values(1, 10, 1);

insert into sandbox_weapon values (1, 1), (2, 1);

One player, one unit and two weapons that reference the same unit.

Confirm that the problem exists:

>>> from sandbox.models import Player
>>> from django.db.models import Count, Sum

>>> Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2}]>

>>> Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'rarity_sum': 10}]>


>>> Player.objects.annotate(
...     weapon_count=Count('unit_set__weapon_set', distinct=True),
...     rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 20}]>

From this example it's easy to see that the problem is that in the combined query the unit will be listed twice, one time for each of the weapons referencing it:

sqlite> SELECT "sandbox_player"."id",
   ...>        "sandbox_player"."name",
   ...>        "sandbox_weapon"."id",
   ...>        "sandbox_unit"."rarity"
   ...> FROM "sandbox_player"
   ...>          LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
   ...>          LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id");
id          name        id          rarity    
----------  ----------  ----------  ----------
1           player_1    1           10        
1           player_1    2           10   

What should you do?

As @ivissani mentioned, one of the easiest solutions would be to write subqueries for each of the aggregations:

>>> from django.db.models import Count, IntegerField, OuterRef, Subquery, Sum
>>> weapon_count = Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).filter(pk=OuterRef('pk'))
>>> rarity_sum = Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).filter(pk=OuterRef('pk'))
>>> qs = Player.objects.annotate(
...     weapon_count=Subquery(weapon_count.values('weapon_count'), output_field=IntegerField()),
...     rarity_sum=Subquery(rarity_sum.values('rarity_sum'), output_field=IntegerField())
... )
>>> qs.values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 10}]>

which produces the following SQL

SELECT "sandbox_player"."id", "sandbox_player"."name", 
(
    SELECT COUNT(U2."id") AS "weapon_count"
    FROM "sandbox_player" U0 
    LEFT OUTER JOIN "sandbox_unit" U1
        ON (U0."id" = U1."player_id")
    LEFT OUTER JOIN "sandbox_weapon" U2 
        ON (U1."id" = U2."unit_id")
    WHERE U0."id" = ("sandbox_player"."id") 
    GROUP BY U0."id", U0."name"
) AS "weapon_count", 
(
    SELECT SUM(U1."rarity") AS "rarity_sum"
    FROM "sandbox_player" U0
    LEFT OUTER JOIN "sandbox_unit" U1
        ON (U0."id" = U1."player_id")
    WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name") AS "rarity_sum"
FROM "sandbox_player"
like image 95
rktavi Avatar answered Oct 19 '22 20:10

rktavi


A few notes to complement rktavi's excellent answer:

1) This issues has apparently been considered a bug for 10 years already. It is even referred to in the official documentation.

2) While converting my actual project's QuerySets to subqueries (as per rktavi's answer), I noticed that combining bare-bone annotations (for the distinct=True counts that always worked correctly) with a Subquery (for the sums) yields extremely long processing (35 sec vs. 100 ms) and incorrect results for the sum. This is true in my actual setup (11 filtered counts on various nested relations and 1 filtered sum on a multiply-nested relation, SQLite3) but cannot be reproduced with the simple models above. This issue can be tricky because another part of your code could add an annotation to your QuerySet (e.g a Table.order_FOO() function), leading to the issue.

3) With the same setup, I have anecdotical evidence that subquery-type QuerySets are faster compared to bare-bone annotation QuerySets (in cases where you have only distinct=True counts, of course). I could observe this both with local SQLite3 (83 ms vs 260 ms) and hosted PostgreSQL (320 ms vs 540 ms).

As a result of the above, I will completely avoid using bare-bone annotations in favour of subqueries.

like image 30
abey Avatar answered Oct 19 '22 18:10

abey


Based on the excellent answer from @rktavi, I created two helpers classes that simplify the Subquery/Count and Subquery/Sum patterns:

class SubqueryCount(Subquery):
    template = "(SELECT count(*) FROM (%(subquery)s) _count)"
    output_field = PositiveIntegerField()


class SubquerySum(Subquery):
    template = '(SELECT sum(_sum."%(column)s") FROM (%(subquery)s) _sum)'

    def __init__(self, queryset, column, output_field=None, **extra):
        if output_field is None:
            output_field = queryset.model._meta.get_field(column)
        super().__init__(queryset, output_field, column=column, **extra)

One can use these helpers like so:

from django.db.models import OuterRef

weapons = Weapon.objects.filter(unit__player_id=OuterRef('id'))
units = Unit.objects.filter(player_id=OuterRef('id'))

qs = Player.objects.annotate(weapon_count=SubqueryCount(weapons),
                             rarity_sum=SubquerySum(units, 'rarity'))
like image 3
Benoit Blanchon Avatar answered Oct 19 '22 20:10

Benoit Blanchon


Thanks @rktavi for your amazing answer!!

Here's my use case:

Using Django DRF.

I needed to get Sum and Count from different FK's inside the annotate so that it would all be part of one queryset in order to add these fields to the ordering_fields in DRF.

The Sum and Count were clashing and returning wrong results. Your answer really helped me put it all together.

The annotate was occasionally returning the dates as strings, so I needed to Cast it to DateTimeField.

    donation_filter =  Q(payments__status='donated') & ~Q(payments__payment_type__payment_type='coupon')
    total_donated_SQ = User.objects.annotate(total_donated=Sum('payments__sum', filter=donation_filter )).filter(pk=OuterRef('pk'))
    message_count_SQ = User.objects.annotate(message_count=Count('events__id', filter=Q(events__event_id=6))).filter(pk=OuterRef('pk'))
    queryset = User.objects.annotate(
        total_donated=Subquery(total_donated_SQ.values('total_donated'), output_field=IntegerField()),
        last_donation_date=Cast(Max('payments__updated', filter=donation_filter ), output_field=DateTimeField()),
        message_count=Subquery(message_count_SQ.values('message_count'), output_field=IntegerField()),
        last_message_date=Cast(Max('events__updated', filter=Q(events__event_id=6)), output_field=DateTimeField())
    )
like image 2
Avramo Avatar answered Oct 19 '22 19:10

Avramo