I have a queryset that returns a lot of data, it can be filtered by year which will return around 100k lines, or show all which will bring around 1 million lines. The objective of this annotate is to generate a xlsx spreadsheet. Models representation, <code>RelatedModel</code> is manytomany between <code>Model</code> and <code>AnotherModel</code> <pre class="prettyprint"><code>Model: id field1 field2 field3 RelatedModel: foreign_key_model (Model) foreign_key_another (AnotherModel) </code></pre> Queryset, if the relation exists it will annotate, this annotate is very slow and can take several minutes. <pre class="prettyprint"><code>Model.objects.all().annotate( related_exists=Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))), related_column=Case( When(related_exists=True, then=Value('The relation exists!')), When(related_exists=False, then=Value('The relation doesn't exist!')), default=Value('This is the default value!'), output_field=CharField(), ) ).values_list( 'related_column', 'field1', 'field2', 'field3' ) </code></pre>

If only thing needed is to change how True / False is displayed in xlsx - one option is to just have one <code>related_exists</code> BooleanField annotation and later customize how it will be converted when creating xlsx document - i.e. in serializer. Database should store raw / unformatted values, and app prepare them to be shown to user. Other things to consider: <ul> <li>Indexes to speed-up filtering.</li> <li>If you have millions of records after filtering, in one table - maybe table partitioning could be considered.</li> </ul> <hr> But let's look into raw sql of original query. It will be like this: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT [model_fields], EXISTS([CLIENT_SELECT]) AS related_exists, CASE WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation exists!' WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation does not exist!' ELSE 'The relation exists!' END AS related_column FROM model; </code></pre> And right away we can see nested query for Exists CLIENT_SELECT is there 3 times. Even though it is exactly the same, it may be executed minimum 2 times and up to 3 times. Database may optimize it to be faster than 3x, but it still is not optimal as 1x. First, <code>EXISTS</code> returns either True or False, we can leave just one check that it is True, making <code>'The relation does not exist!'</code> the default value. <pre class="prettyprint"><code> related_column=Case( When(related_exists=True, then=Value('The relation exists!')), default=Value('The relation does not exist!') </code></pre> Why <code>related_column</code> performs same select again and not takes the value of <code>related_exists</code>? Because we cannot reference calculated columns while calculating another columns - and this is database level constraint django knows about and duplicates expression. Wait, then we actually do not need <code>related_exists</code> column, lets just leave <code>related_column</code> with CASE statement and 1 exists subquery. Here comes Django - we cannot (till 3.0) use expressions in filters without annotating them first. So, it our case it is like: in order to use <code>Exist</code> in <code>When</code>, we first need to add it as annotation, but it won't be used as a reference, but a full copy of expression. <hr> Good news! Since Django 3.0 we can use expressions that output BooleanField directly in QuerySet filters, without having to first annotate. <code>Exists</code> is one of such BooleaField expressions. <pre class="prettyprint"><code>Model.objects.all().annotate( related_column=Case( When( Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))), then=Value('The relation exists!'), ), default=Value('The relation doesn't exist!'), output_field=CharField(), ) ) </code></pre> And only one nested select, and one annotated field. <hr> Django 2.1, 2.2 Here's the commit that finalized allowance of boolean expressions although many pre-conditions for it were added earlier. One of them is presence of <code>conditional</code> attribute on expression object and check for this attribute. So, although not recommended and not tested it seems quite working little hack for Django 2.1, 2.2 (before there was no <code>conditional</code> check, and it will require more intrusive changes): <ul> <li>create <code>Exists</code> expression instance</li> <li>monkey patch it with <code>conditional = True</code> </li> <li>use it as condition in <code>When</code> statement</li> </ul> <pre class="prettyprint"><code>related_model_exists = Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))) setattr(related_model_exists, 'conditional', True) Model.objects.all().annotate( related_column=Case( When( relate_model_exists, then=Value('The relation exists!'), ), default=Value('The relation doesn't exist!'), output_field=CharField(), ) ) </code></pre> <hr> Related checks <code>relatedmodel_set__isnull=True</code> check is not suitable for several reasons: <ul> <li>it performs <code>LEFT OUTER JOIN</code> - that is less efficient than <code>EXISTS</code> </li> <li>it performs <code>LEFT OUTER JOIN</code> - it joins tables, this makes it ONLY suitable in filter() condition (not in annotate - When), and only for OneToOne or OneToMany (One is on relatedmodel side) relations</li> </ul>

Improve Django queryset performance when using annotate Exists

Tags:

django

django-queryset

query-performance

I have a queryset that returns a lot of data, it can be filtered by year which will return around 100k lines, or show all which will bring around 1 million lines.

The objective of this annotate is to generate a xlsx spreadsheet.

Models representation, RelatedModel is manytomany between Model and AnotherModel

Model:
    id
    field1
    field2
    field3

RelatedModel:
    foreign_key_model (Model)
    foreign_key_another (AnotherModel)

Queryset, if the relation exists it will annotate, this annotate is very slow and can take several minutes.

Model.objects.all().annotate(
    related_exists=Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))),
    related_column=Case(
        When(related_exists=True, then=Value('The relation exists!')),
        When(related_exists=False, then=Value('The relation doesn't exist!')),
        default=Value('This is the default value!'),
        output_field=CharField(),
    )
).values_list(
    'related_column',
    'field1',
    'field2',
    'field3'
)

874

asked Nov 27 '19 22:11

Huskell

2 Answers

If only thing needed is to change how True / False is displayed in xlsx - one option is to just have one related_exists BooleanField annotation and later customize how it will be converted when creating xlsx document - i.e. in serializer. Database should store raw / unformatted values, and app prepare them to be shown to user.

Other things to consider:

Indexes to speed-up filtering.
If you have millions of records after filtering, in one table - maybe table partitioning could be considered.

But let's look into raw sql of original query. It will be like this:

SELECT [model_fields],
       EXISTS([CLIENT_SELECT]) AS related_exists,
       CASE
       WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation exists!'
       WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation does not exist!'
       ELSE 'The relation exists!'
       END AS related_column
FROM model;

And right away we can see nested query for Exists CLIENT_SELECT is there 3 times. Even though it is exactly the same, it may be executed minimum 2 times and up to 3 times. Database may optimize it to be faster than 3x, but it still is not optimal as 1x.

First, EXISTS returns either True or False, we can leave just one check that it is True, making 'The relation does not exist!' the default value.

    related_column=Case(
        When(related_exists=True, then=Value('The relation exists!')),
        default=Value('The relation does not exist!')

Why related_column performs same select again and not takes the value of related_exists?

Because we cannot reference calculated columns while calculating another columns - and this is database level constraint django knows about and duplicates expression.

Wait, then we actually do not need related_exists column, lets just leave related_column with CASE statement and 1 exists subquery.

Here comes Django - we cannot (till 3.0) use expressions in filters without annotating them first.

So, it our case it is like: in order to use Exist in When, we first need to add it as annotation, but it won't be used as a reference, but a full copy of expression.

Good news!

Since Django 3.0 we can use expressions that output BooleanField directly in QuerySet filters, without having to first annotate. Exists is one of such BooleaField expressions.

Model.objects.all().annotate(
    related_column=Case(
        When(
            Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))),
            then=Value('The relation exists!'),
        ),
        default=Value('The relation doesn't exist!'),
        output_field=CharField(),
    )
)

And only one nested select, and one annotated field.

Django 2.1, 2.2

Here's the commit that finalized allowance of boolean expressions although many pre-conditions for it were added earlier. One of them is presence of conditional attribute on expression object and check for this attribute.

So, although not recommended and not tested it seems quite working little hack for Django 2.1, 2.2 (before there was no conditional check, and it will require more intrusive changes):

create Exists expression instance
monkey patch it with conditional = True
use it as condition in When statement

related_model_exists = Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id')))

setattr(related_model_exists, 'conditional', True)

Model.objects.all().annotate(
    related_column=Case(
        When(
            relate_model_exists,
            then=Value('The relation exists!'),
        ),
        default=Value('The relation doesn't exist!'),
        output_field=CharField(),
    )
)

Related checks

relatedmodel_set__isnull=True check is not suitable for several reasons:

it performs LEFT OUTER JOIN - that is less efficient than EXISTS
it performs LEFT OUTER JOIN - it joins tables, this makes it ONLY suitable in filter() condition (not in annotate - When), and only for OneToOne or OneToMany (One is on relatedmodel side) relations

189

answered Sep 17 '22 11:09

Oleg Russkin

You can considerably simplify your query to:

from django.db.models import Count
Model.objects.all().annotate(
    related_column=Case(
        When(relatedmodel_set__isnull=True, then=Value("The relation doesn't exist!")), 
        default=Value("The relation exists!"), 
        output_field=CharField()
    )
)

Where relatedmodel_set is the related_name on your foreign key.

answered Sep 17 '22 11:09

solarissmoke

Related questions
                            
                                How can I remove the add and change buttons from a TabularInline admin field?
                            
                                module 'importlib._bootstrap' has no attribute '_w_long'
                            
                                Creating django form with null and blank field
                            
                                Remove add another from django admin
                            
                                Django build video website similar to YouTube
                            
                                Pass parameter to django as_view function
                            
                                Send file through Django Rest
                            
                                Webpack setup with Django
                            
                                Django: object creation in atomic transaction
                            
                                Running daphne behind nginx reverse proxy with protocol upgrade always routes to http instead of websocket
                            
                                Why should I set max_length when using Choices in a Django model?
                            
                                How do i temporarily disable db integrity constraints in django - postgresql
                            
                                Getting a list from Django into Javascript as an array
                            
                                `Cannot open include file: 'apr_perms_set.h'` when doing `pip install mod_wsgi`
                            
                                Django annotate add interval to date
                            
                                How To Fix Miscased Procfile in Heroku
                            
                                "detail": "Method \"GET\" not allowed." Django Rest Framework
                            
                                GeoDjango can't find gdal on docker python alpine based image
                            
                                Unauthorized response to POST request in Django Rest Framework with JWT Token
                            
                                Python Error : (fields.E304) Reverse accessor for field clashes with reverse accessor for another field

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With