Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve Django queryset performance when using annotate Exists

I have a queryset that returns a lot of data, it can be filtered by year which will return around 100k lines, or show all which will bring around 1 million lines.

The objective of this annotate is to generate a xlsx spreadsheet.

Models representation, RelatedModel is manytomany between Model and AnotherModel

Model:
    id
    field1
    field2
    field3

RelatedModel:
    foreign_key_model (Model)
    foreign_key_another (AnotherModel)

Queryset, if the relation exists it will annotate, this annotate is very slow and can take several minutes.

Model.objects.all().annotate(
    related_exists=Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))),
    related_column=Case(
        When(related_exists=True, then=Value('The relation exists!')),
        When(related_exists=False, then=Value('The relation doesn't exist!')),
        default=Value('This is the default value!'),
        output_field=CharField(),
    )
).values_list(
    'related_column',
    'field1',
    'field2',
    'field3'
)
like image 874
Huskell Avatar asked Nov 27 '19 22:11

Huskell


People also ask

Is Django QuerySet lazy?

This is because a Django QuerySet is a lazy object. It contains all of the information it needs to populate itself from the database, but will not actually do so until the information is needed.

What is difference between annotate and aggregate Django?

In the Django framework, both annotate and aggregate are responsible for identifying a given value set summary. Among these, annotate identifies the summary from each of the items in the queryset. Whereas in the case of aggregate, the summary is calculated for the entire queryset.

What is F in Django QuerySet?

In the Django QuerySet API, F() expressions are used to refer to model field values directly in the database.

Why Django ORM is slow?

Django's ORM is fantastic. It's slow because it chooses to be convenient but if it needs to be fast it's just a few slight API calls away. If you're curious, check out the code on Github.


2 Answers

If only thing needed is to change how True / False is displayed in xlsx - one option is to just have one related_exists BooleanField annotation and later customize how it will be converted when creating xlsx document - i.e. in serializer. Database should store raw / unformatted values, and app prepare them to be shown to user.

Other things to consider:

  • Indexes to speed-up filtering.
  • If you have millions of records after filtering, in one table - maybe table partitioning could be considered.

But let's look into raw sql of original query. It will be like this:

SELECT [model_fields],
       EXISTS([CLIENT_SELECT]) AS related_exists,
       CASE
       WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation exists!'
       WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation does not exist!'
       ELSE 'The relation exists!'
       END AS related_column
FROM model;

And right away we can see nested query for Exists CLIENT_SELECT is there 3 times. Even though it is exactly the same, it may be executed minimum 2 times and up to 3 times. Database may optimize it to be faster than 3x, but it still is not optimal as 1x.

First, EXISTS returns either True or False, we can leave just one check that it is True, making 'The relation does not exist!' the default value.

    related_column=Case(
        When(related_exists=True, then=Value('The relation exists!')),
        default=Value('The relation does not exist!')

Why related_column performs same select again and not takes the value of related_exists?

Because we cannot reference calculated columns while calculating another columns - and this is database level constraint django knows about and duplicates expression.

Wait, then we actually do not need related_exists column, lets just leave related_column with CASE statement and 1 exists subquery.

Here comes Django - we cannot (till 3.0) use expressions in filters without annotating them first.

So, it our case it is like: in order to use Exist in When, we first need to add it as annotation, but it won't be used as a reference, but a full copy of expression.


Good news!

Since Django 3.0 we can use expressions that output BooleanField directly in QuerySet filters, without having to first annotate. Exists is one of such BooleaField expressions.

Model.objects.all().annotate(
    related_column=Case(
        When(
            Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))),
            then=Value('The relation exists!'),
        ),
        default=Value('The relation doesn't exist!'),
        output_field=CharField(),
    )
)

And only one nested select, and one annotated field.


Django 2.1, 2.2

Here's the commit that finalized allowance of boolean expressions although many pre-conditions for it were added earlier. One of them is presence of conditional attribute on expression object and check for this attribute.

So, although not recommended and not tested it seems quite working little hack for Django 2.1, 2.2 (before there was no conditional check, and it will require more intrusive changes):

  • create Exists expression instance
  • monkey patch it with conditional = True
  • use it as condition in When statement
related_model_exists = Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id')))

setattr(related_model_exists, 'conditional', True)

Model.objects.all().annotate(
    related_column=Case(
        When(
            relate_model_exists,
            then=Value('The relation exists!'),
        ),
        default=Value('The relation doesn't exist!'),
        output_field=CharField(),
    )
)


Related checks

relatedmodel_set__isnull=True check is not suitable for several reasons:

  • it performs LEFT OUTER JOIN - that is less efficient than EXISTS
  • it performs LEFT OUTER JOIN - it joins tables, this makes it ONLY suitable in filter() condition (not in annotate - When), and only for OneToOne or OneToMany (One is on relatedmodel side) relations
like image 189
Oleg Russkin Avatar answered Sep 17 '22 11:09

Oleg Russkin


You can considerably simplify your query to:

from django.db.models import Count
Model.objects.all().annotate(
    related_column=Case(
        When(relatedmodel_set__isnull=True, then=Value("The relation doesn't exist!")), 
        default=Value("The relation exists!"), 
        output_field=CharField()
    )
)

Where relatedmodel_set is the related_name on your foreign key.

like image 31
solarissmoke Avatar answered Sep 17 '22 11:09

solarissmoke