Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiple annotate Sum terms yields inflated answer

In the following setup, I'd like a QuerySet with a list of projects, each annotated with the sum of all its task durations (as tasks_duration) and the sum of all of its tasks' subtask durations (as subtasks_duration). My models (simplified) look like this:

class Project(models.Model):
    pass

class Task(models.Model):
    project = models.ForeignKey(Project)
    duration = models.IntegerField(blank=True, null=True)

class SubTask(models.Model):
    task = models.ForeignKey(Task)
    duration = models.IntegerField(blank=True, null=True)

I make my QuerySet like this:

Projects.objects.annotate(tasks_duration=Sum('task__duration'), subtasks_duration=Sum('task__subtask__duration'))

Related to the behaviour explained in Django annotate() multiple times causes wrong answers I get a tasks_duration that is much higher than it should be. The multiple annotate(Sum()) clauses yield multiple left inner joins in the resultant SQL. With only a single annotate(Sum()) term for tasks_duration, the result is correct. However, I'd like to have both tasks_duration and subtasks_duration.

What would be a suitable way to do this query? I have a working solution that does it per-project, but that's expectedly unusably slow. I also have something similar working with an extra() call, but I'd really like to know if what I want is possible with pure Django.

like image 388
Charl Botha Avatar asked Aug 24 '12 11:08

Charl Botha


2 Answers

The bug is reported here but it's not solved yet even in Django 1.11. The issue is related to joining two tables in reverse relations. Notice that distinct parameter works well for Count but not for Sum. So you can use a trick and write an ORM like below:

 Projects.objects.annotate(
      temp_tasks_duration=Sum('task__duration'),
      temp_subtasks_duration=Sum('task__subtask__duration'),
      tasks_count=Count('task'),
      tasks_count_distinct=Count('task', distinct=True),
      task_subtasks_count=Count('task__subtask'),
      task_subtasks_count_distinct=Count('task__subtask', distinct=True),
 ).annotate(
      tasks_duration=F('temp_tasks_duration')*F('tasks_count_distinct')/F('tasks_count'),
      subtasks_duration=F('temp_subtasks_duration')*F('subtasks_count_distinct')/F('subtasks_count'),
 )

Update: I found that you need to use Subquery. In the following solution, firstly you filter tasks for related to the outerref (OuterRef references to the outer query, so the tasks are filtered for each Project), then you group the tasks by 'project', so that the Sum applies on all the tasks of each projects and returns just one result if any task exists for the project (you have filtered by 'project' and then grouped by that same field; That's why just one group can be there.) or None otherwise. The result would be None if the project has no task, that means we can not use [0] to select the calculated sum.

from django.db.models import Subquery, OuterRef
Projects.objects.annotate(
    tasks_duration=Subquery(
        Task.objects.filter(
            project=OuterRef('pk')
        ).values(
            'project'
        ).annotate(
            the_sum=Sum('task__duration'),
        ).values('the_sum')[:1]
    ),
    subtasks_duration=Sum('task__subtask__duration')
)

Running this code will send just one query to the database, so the performance is great.

like image 55
Happy Ahmad Avatar answered Nov 17 '22 02:11

Happy Ahmad


I get this error as well. Exact same code. It works if I do the aggregation separately, but once I try to get both sums at the same time, one of them gets a factor 2 higher, and the other a factor 3.

I have no idea why Django behaves this way. I have filed a bug report here: https://code.djangoproject.com/ticket/19011 You might be interested in following it as well.

like image 45
tBuLi Avatar answered Nov 17 '22 00:11

tBuLi