I've run into an interesting situation in a new app I've added to an existing project. My goal is to (using a Celery task) update many rows at once with a value that includes annotated aggregated values from foreign keyed objects. Here are some example models that I've used in previous questions:
class Book(models.model):
author = models.CharField()
num_pages = models.IntegerField()
num_chapters = models.IntegerField()
class UserBookRead(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL)
user_book_stats = models.ForeignKey(UserBookStats)
book = models.ForeignKey(Book)
complete = models.BooleanField(default=False)
pages_read = models.IntegerField()
class UserBookStats(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL)
total_pages_read = models.IntegerField()
I'm attempting to:
post_save
signal from Book
instances to update pages_read
on related UserBookRead
objects when a Book
page count is updated.pages_read
from each UserBookRead
which was updated, and update the total_pages_read
on each related UserBookStats
(This is where the problem occurs)I'm trying to be as lean as possible as far as number of queries- step 1 is complete and only requires a few queries for my actual use case, which seems acceptable for a signal handler, as long as those queries are optimized properly.
Step 2 is more involved, hence the delegation to a background task. I've managed to accomplish most of it in a fairly clean manner (well, for me at least).
The problem I run into is that when annotating the UserBookStats
queryset with a total_pages
aggregation (the Sum()
of all pages_read
for related UserBookRead
objects), I can't follow that with a straight update
of the queryset to set the total_pages_read
field.
Here's the code (the Book
instance is passed to the task as book
):
# use the provided book instance to get the stats which need to be updated
book_read_objects= UserBookRead.objects.filter(book=book)
book_stat_objects = UserBookStats.objects.filter(id__in=book_read_objects.values_list('user_book_stats__id', flat=True).distinct())
# annotate top level stats objects with summed page count
book_stat_objects = book_stat_objects.annotate(total_pages=Sum(F('user_book_read__pages_read')))
# update the objects with that sum
book_stat_objects.update(total_pages_read=F('total_pages'))
On executing the last line, this error is thrown:
django.core.exceptions.FieldError: Aggregate functions are not allowed in this query
After some research, I found an existing Django ticket for this use case here, on which the last comment mentions 2 new features in 1.11 that could make it possible.
Is there any known/accepted way to accomplish this use case, perhaps using Subquery
or OuterRef
? I haven't had any success trying to fold in the aggregation as a Subquery
. The fallback here is:
for obj in book_stat_objects:
obj.total_pages_read = obj.total_pages
obj.save()
But with potentially tens of thousands of records in book_stat_objects
, I'm really trying to avoid issuing an UPDATE for each one individually.
I ended up figuring out how to do this with Subquery
and OuterRef
, but had to take a different approach than I originally expected.
I was able to quickly get a Subquery
working, however when I used it to annotate the parent query, I noticed that every annotated value was the first result of the subquery- this was when I realized I needed OuterRef
, because the generated SQL wasn't restricting the subquery by anything in the parent query.
This part of the Django docs was super helpful, as was this StackOverflow question. What this process boils down to is that you have to use Subquery
to create the aggregation, and OuterRef
to ensure the subquery restricts aggregated rows by the parent query PK. At that point, you can annotate with the aggregated value and directly make use of it in a queryset update()
.
As I mentioned in the question, the code examples are made up. I've tried to adapt them to my actual use case with my changes:
from django.db.models import Subquery, OuterRef
from django.db.models.functions import Coalesce
# create the queryset to use as the subquery, restrict based on the `book_stat_objects` queryset
book_reads = UserBookRead.objects.filter(user_book_stat__in=book_stat_objects, user_book_stats=OuterRef('pk')).values('user_book_stats')
# annotate the future subquery with the aggregation of pages_read from each UserBookRead
total_pages = book_reads.annotate(total=Sum(F('pages_read')))
# annotate each stat object with the subquery total
book_stats = book_stats.annotate(total=Coalesce(Subquery(total_pages), 0))
# update each row with the new total pages count
book_stats.update(total_pages_read=F('total'))
It felt odd to create a queryset that cant be used on it's own (trying to evaluate book_reads
will throw an error due to the inclusion of OuterRef
), but once you examine the final SQL generated for book_stats
, it makes sense.
EDIT
I ended up running into a bug with this code a week or two after figuring out this answer. It turned out to be due to a default ordering
for the UserBookRead
model. As the Django docs state, default ordering
is incorporated into any aggregate GROUP BY
clauses, so all of my aggregates were off. The solution to that is to clear the default ordering with a blank order_by()
when creating the base subquery:
book_reads = UserBookRead.objects.filter(user_book_stat__in=book_stat_objects, user_book_stats=OuterRef('pk')).values('user_book_stats').order_by()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With