Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django: using an annotated aggregate in queryset update()

I've run into an interesting situation in a new app I've added to an existing project. My goal is to (using a Celery task) update many rows at once with a value that includes annotated aggregated values from foreign keyed objects. Here are some example models that I've used in previous questions:

class Book(models.model):
    author = models.CharField()
    num_pages = models.IntegerField()
    num_chapters = models.IntegerField()

class UserBookRead(models.Model):
    user = models.ForeignKey(settings.AUTH_USER_MODEL)
    user_book_stats = models.ForeignKey(UserBookStats)
    book = models.ForeignKey(Book)
    complete = models.BooleanField(default=False)
    pages_read = models.IntegerField()

class UserBookStats(models.Model):
    user = models.ForeignKey(settings.AUTH_USER_MODEL)
    total_pages_read = models.IntegerField()

I'm attempting to:

  1. Use the post_save signal from Book instances to update pages_read on related UserBookRead objects when a Book page count is updated.
  2. At the end of the signal, launch a background Celery task to roll up the pages_read from each UserBookRead which was updated, and update the total_pages_read on each related UserBookStats (This is where the problem occurs)

I'm trying to be as lean as possible as far as number of queries- step 1 is complete and only requires a few queries for my actual use case, which seems acceptable for a signal handler, as long as those queries are optimized properly.

Step 2 is more involved, hence the delegation to a background task. I've managed to accomplish most of it in a fairly clean manner (well, for me at least).

The problem I run into is that when annotating the UserBookStats queryset with a total_pages aggregation (the Sum() of all pages_read for related UserBookRead objects), I can't follow that with a straight update of the queryset to set the total_pages_read field.

Here's the code (the Book instance is passed to the task as book):

# use the provided book instance to get the stats which need to be updated
book_read_objects= UserBookRead.objects.filter(book=book)
book_stat_objects = UserBookStats.objects.filter(id__in=book_read_objects.values_list('user_book_stats__id', flat=True).distinct())

# annotate top level stats objects with summed page count
book_stat_objects = book_stat_objects.annotate(total_pages=Sum(F('user_book_read__pages_read')))

# update the objects with that sum
book_stat_objects.update(total_pages_read=F('total_pages'))

On executing the last line, this error is thrown:

django.core.exceptions.FieldError: Aggregate functions are not allowed in this query

After some research, I found an existing Django ticket for this use case here, on which the last comment mentions 2 new features in 1.11 that could make it possible.

Is there any known/accepted way to accomplish this use case, perhaps using Subquery or OuterRef? I haven't had any success trying to fold in the aggregation as a Subquery. The fallback here is:

for obj in book_stat_objects:
    obj.total_pages_read = obj.total_pages
    obj.save()

But with potentially tens of thousands of records in book_stat_objects, I'm really trying to avoid issuing an UPDATE for each one individually.

like image 544
dkhaupt Avatar asked Dec 13 '22 18:12

dkhaupt


1 Answers

I ended up figuring out how to do this with Subquery and OuterRef, but had to take a different approach than I originally expected.

I was able to quickly get a Subquery working, however when I used it to annotate the parent query, I noticed that every annotated value was the first result of the subquery- this was when I realized I needed OuterRef, because the generated SQL wasn't restricting the subquery by anything in the parent query.

This part of the Django docs was super helpful, as was this StackOverflow question. What this process boils down to is that you have to use Subquery to create the aggregation, and OuterRef to ensure the subquery restricts aggregated rows by the parent query PK. At that point, you can annotate with the aggregated value and directly make use of it in a queryset update().

As I mentioned in the question, the code examples are made up. I've tried to adapt them to my actual use case with my changes:

from django.db.models import Subquery, OuterRef
from django.db.models.functions import Coalesce

# create the queryset to use as the subquery, restrict based on the `book_stat_objects` queryset
book_reads = UserBookRead.objects.filter(user_book_stat__in=book_stat_objects, user_book_stats=OuterRef('pk')).values('user_book_stats')
# annotate the future subquery with the aggregation of pages_read from each UserBookRead
total_pages = book_reads.annotate(total=Sum(F('pages_read')))
# annotate each stat object with the subquery total
book_stats = book_stats.annotate(total=Coalesce(Subquery(total_pages), 0))
# update each row with the new total pages count
book_stats.update(total_pages_read=F('total'))

It felt odd to create a queryset that cant be used on it's own (trying to evaluate book_reads will throw an error due to the inclusion of OuterRef), but once you examine the final SQL generated for book_stats, it makes sense.

EDIT

I ended up running into a bug with this code a week or two after figuring out this answer. It turned out to be due to a default ordering for the UserBookRead model. As the Django docs state, default ordering is incorporated into any aggregate GROUP BY clauses, so all of my aggregates were off. The solution to that is to clear the default ordering with a blank order_by() when creating the base subquery:

book_reads = UserBookRead.objects.filter(user_book_stat__in=book_stat_objects, user_book_stats=OuterRef('pk')).values('user_book_stats').order_by()
like image 127
dkhaupt Avatar answered Jan 22 '23 15:01

dkhaupt