Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django: Duplicated logic between properties and queryset annotations

When I want to define my business logic, I'm struggling finding the right way to do this, because I often both need a property AND a custom queryset to get the same info. In the end, the logic is duplicated.

Let me explain...

First, after defining my class, I naturally start writing a simple property for data I need:

class PickupTimeSlot(models.Model):

    @property
    def nb_bookings(self) -> int:
        """ How many times this time slot is booked? """ 
        return self.order_set.validated().count()

Then, I quickly realise that calling this property while dealing with many objects in a queryset will lead to duplicated queries and will kill performance (even if I use prefetching, because filtering is called again). So I solve the problem writing a custom queryset with annotation:

class PickupTimeSlotQuerySet(query.QuerySet):

    def add_nb_bookings_data(self):
        return self.annotate(db_nb_bookings=Count('order', filter=Q(order__status=Order.VALIDATED)))

The issue

And then, I end up with 2 problems:

  • I have the same business logic ("how to find the number of bookings") written twice, that could lead to functional errors.
  • I need to find two different attribute names to avoid conflicts, because obviously, setting nb_bookings for both the property and the annotation don't work. This forces me, when using my object, to think about how the data is generated, to call the right attribute name (let's say pickup_slot.nb_bookings (property) or pickup_slot.db_nb_bookings (annotation) )

This seems poorly designed to me, and I'm pretty sure there is a way to do better. I'd need a way to always write pickup_slot.nb_bookings and having a performant answer, always using the same business logic.

I have an idea, but I'm not sure...

I was thinking of completely removing the property and keeking custom queryset only. Then, for single objects, wrapping them in querysets just to be able to call add annotation data on it. Something like:

pickup_slot = PickupTimeSlot.objects.add_nb_bookings_data().get(pk=pickup_slot.pk)

Seems pretty hacky and unnatural to me. What do you think?

like image 564
David D. Avatar asked Oct 08 '20 08:10

David D.


People also ask

What are queryable annotating properties?

Queryable properties that implement annotating can be used like regular model fields in various queryset operations without the need to explicitly add the annotation to a queryset.

What are the advantages of using annotations in a query?

Since annotations in a queryset behave like regular fields, they automatically offer some advantages: They can be used for queryset filtering without the need to explicitly implement filter behavior - though queryable properties still offer the option to implement custom filtering, even if a property is annotatable.

Why can’t I combine multiple values in Django?

The reason for this is that Django uses JOIN s and GROUP BY clauses in order to generate the aggregated values, but they are not automatically grouped by application. Instead, the GROUP BY clause only contains the columns of the Category model, leading to one total value per category.

How do you perform a SQL GROUP BY in Django?

Django’s mechanism for performing a SQL group by is through annotate and aggregate. In this piece, let’s revise our annotate and aggregate knowledge. This post is beginner-friendly and you should be able to follow along even if you haven’t previously used Django annotations. Every question now has an attribute called choice_count.


4 Answers

I don't think there is a silver bullet here. But I use this pattern in my projects for such cases.

class PickupTimeSlotAnnotatedManager(models.Manager):
    def with_nb_bookings(self):
        return self.annotate(
            _nb_bookings=Count(
                'order', filter=Q(order__status=Order.VALIDATED)
            )
        )

class PickupTimeSlot(models.Model):
    ...
    annotated = PickupTimeSlotAnnotatedManager()

    @property
    def nb_bookings(self) -> int:
        """ How many times this time slot is booked? """ 
        if hasattr(self, '_nb_bookings'):
            return self._nb_bookings
        return self.order_set.validated().count()

In code

qs = PickupTimeSlot.annotated.with_nb_bookings()
for item in qs:
    print(item.nb_bookings)

This way I can always use property, if it is part of annotated queryset it will use annotated value if not it will calculate it. This approach guaranties that I will have full control of when to make queryset "heavier" by annotating it with required values. If I don't need this I just use regular PickupTimeSlot.objects. ...

Also if there are many such properties you could write decorator that will wrap property and simplify code. It will work as cached_property decorator, but instead it will use annotated value if it is present.

like image 143
Sardorbek Imomaliev Avatar answered Oct 12 '22 14:10

Sardorbek Imomaliev


TL;DR

  • Do you need to filter the "annotated field" results?

    • If Yes, "Keep" the manager and use it when required. In any other situation, use property logic
    • If No, remove the manager/annotation process and stick with property implementation, unless your table is small (~1000 entries) and not growing over the period.
  • The only advantage of annotation process I am seeing here is the filtering capability on the database level of the data


I have conducted a few tests to reach the conclusion, here they are

Environment

  • Django 3.0.7
  • Python 3.8
  • PostgreSQL 10.14

Model Structure

For the sake of simplicity and simulation, I am following the below model representation

class ReporterManager(models.Manager):
    def article_count_qs(self):
        return self.get_queryset().annotate(
            annotate_article_count=models.Count('articles__id', distinct=True))


class Reporter(models.Model):
    objects = models.Manager()
    counter_manager = ReporterManager()
    name = models.CharField(max_length=30)

    @property
    def article_count(self):
        return self.articles.distinct().count()

    def __str__(self):
        return self.name


class Article(models.Model):
    headline = models.CharField(max_length=100)
    reporter = models.ForeignKey(Reporter, on_delete=models.CASCADE,
                                 related_name="articles")

    def __str__(self):
        return self.headline

I have populated my database, both Reporter and Article model with random strings.

  • Reporter rows ~220K (220514)
  • Article rows ~1M (997311)

Test Cases

  1. Random picking of Reporter instance and retrieves the article count. We usually do this in the Detail View
  2. A paginated result. We slice the queryset and iterates over the sliced queryset.
  3. Filtering

I am using the %timeit-(ipython doc) command of Ipython shell to calculate the execution time

Test Case 1

For this, I have created these functions, which randomly pick instances from the database

import random

MAX_REPORTER = 220514


def test_manager_random_picking():
    pos = random.randint(1, MAX_REPORTER)
    return Reporter.counter_manager.article_count_qs()[pos].annotate_article_count


def test_property_random_picking():
    pos = random.randint(1, MAX_REPORTER)
    return Reporter.objects.all()[pos].article_count

Results

In [2]: %timeit test_manager_random_picking()
8.78 s ± 6.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit test_property_random_picking()
6.36 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Test Case 2

I have created another two functions,

import random

PAGINATE_SIZE = 50


def test_manager_paginate_iteration():
    start = random.randint(1, MAX_REPORTER - PAGINATE_SIZE)
    end = start + PAGINATE_SIZE
    qs = Reporter.counter_manager.article_count_qs()[start:end]
    for reporter in qs:
        reporter.annotate_article_count


def test_property_paginate_iteration():
    start = random.randint(1, MAX_REPORTER - PAGINATE_SIZE)
    end = start + PAGINATE_SIZE
    qs = Reporter.objects.all()[start:end]
    for reporter in qs:
        reporter.article_count

Results

In [8]: %timeit test_manager_paginate_iteration()
4.99 s ± 312 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit test_property_paginate_iteration()
47 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Test Case 3

undoubtedly, annotation is the only way here

Here you can see, the annotation process takes a huge amount of time as compared to the property implementation.

like image 39
JPG Avatar answered Oct 12 '22 13:10

JPG


To avoid any duplication, one option could be:

  • remove the property in the Model
  • use a custom Manager
  • override it's get_queryset() method:
class PickupTimeSlotManager(models.Manager):

    def get_queryset(self):
        return super().get_queryset().annotate(
            db_nb_bookings=Count(
                'order', filter=Q(order__status=Order.VALIDATED)
            )
        )
from django.db import models
from .managers import PickupTimeSlotManager

class PickupTimeSlot(models.Model):
    ...
    # Add custom manager
    objects = PickupTimeSlotManager()

advantage: the calculated properties is transparently added to any queryset; no further action is required to use it

disadvantage: the computational overhead occurs even when the calculated property is not used

like image 3
Mario Orlandi Avatar answered Oct 12 '22 13:10

Mario Orlandi


Let this be the alternative way to archive what you want:

Since I usually add the prefetch_related every time I write a queryset. So when I face this problem, I will use Python to solve this problem.

I'm going to use Python to loop and count the data for me instead of doing it in SQL way.

class PickupTimeSlot(models.Model):

    @property
    def nb_bookings(self) -> int:
        """ How many times this time slot is booked? """ 
        orders = self.order_set.all()  # this won't hit the database if you already did the prefetch_related
        validated_orders = filter(lambda x: x.status == Order.VALIDATED, orders)
        return len(validated_orders)

And most important thing, prefetch_related:

time_slots = PickupTimeSlot.objects.prefetch_related('order_set').all()

You may have a question that why I didn't prefetch_related with filtered queryset so Python doesn't need to filter again like:

time_slots = PickupTimeSlot.objects.prefetch_related(
    Prefetch('order_set', queryset=Order.objects.filter(status=Order.VALIDATED))
).all()

The answer is there are sometimes that we also need the other information from orders as well. Doing the first way will not cost anything more if we're going to prefetch it anyway.

Hope this more or less helps you. Have a nice day!

like image 1
Preeti Y. Avatar answered Oct 12 '22 13:10

Preeti Y.