Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django ORM query optimisation with multiple joins

In my app, I can describe an Entity using different Protocols, with each Protocol being a collection of various Traits, and each Trait allows two or more Classes. So, a Description is a collection of Expressions. E.g., I want to describe an entity "John" with the Protocol "X" that comprises the following two Traits and Classes:

Protocol ABC

Trait 1: Height

Available Classes: a. Short b. Medium c. Tall

Trait 2: Weight

Available Classes: a. Light b. Medium c. Heavy

John's Description: Expression 1: c. Tall, Expression 2: b. Medium

My model specification (barebone essentials for simplicity):

class Protocol(models.Model):
    """
    A Protocol is a collection of Traits
    """
    name = models.CharField()

class Trait(models.Model):
    """
    Stores the Traits. Each Trait can have multiple Classes
    """

    name = models.CharField()
    protocol = models.ForeignKey(
        Protocol,
        help_text="The reference protocol of the trait",
    )

class Class(models.Model):
    """
    Stores the different Classes related to a Trait.
    """

    name = models.CharField()
    trait = models.ForeignKey(Trait)

class Description(models.Model):
    """
    Stores the Descriptions. A description is a collection of Expressions.
    """

    name = models.CharField()
    protocol = models.ForeignKey(
        Protocol,
        help_text="reference to the protocol used to make the description;\
            this will define which Traits will be available",
    )
    entity = models.ForeignKey(
        Entity,
        help_text="the Entity to which the description refers to",
    )

class Expression(models.Model):
    """
    Stores the expressions of entities related to a specific
    Description. It refers to one particular Class (which is
    then associated with a specific Trait)
    """

    class = models.ForeignKey(Class)
    description = models.ForeignKey(Description)

Following the previous example, let's say I want to find all the Entities that are medium or tall (Trait 1) and heavy (Trait 2). The query I'm now using is the following:

# This is the filter returned by the HTML form, which list
# all the available Classes for each Trait of the selected Protocol
filters = [
  {'trait': 1, 'class': [2, 3]},
  {'trait': 2, 'class': [6,]},
]

queryset = Description.objects.all()

for filter in filters:
  queryset = queryset.filter(expression_set__class__in=filter["class"])

The problem is that the query is slow (I have ATM ~1000 Descriptions, described with a Protocol of 40 Traits, each Trait having 2 to 5 Classes). It takes about two seconds to return the results even when filtering by only 5-6 Expressions. I tried using prefetch_related("expression_set") or prefetch_related("expression_set__class") but with no significant improvement.

The question is: can you suggest a way to improve the performance, or this is simply the reality of searching through so many tables?

Thank you very much for your time.


EDIT: The following is the query generated by the Manager when, e.g., eight filters (see previous code snippet) are applied.

SELECT "describe_description"."id",
       "describe_description"."name",
       "describe_description"."protocol_id",
  FROM "describe_description"
 INNER JOIN "describe_expression"
    ON ("describe_description"."id" = "describe_expression"."description_id")
 INNER JOIN "describe_expression" T4
    ON ("describe_description"."id" = T4."description_id")
 INNER JOIN "describe_expression" T6
    ON ("describe_description"."id" = T6."description_id")
 INNER JOIN "describe_expression" T8
    ON ("describe_description"."id" = T8."description_id")
 INNER JOIN "describe_expression" T10
    ON ("describe_description"."id" = T10."description_id")
 INNER JOIN "describe_expression" T12
    ON ("describe_description"."id" = T12."description_id")
 INNER JOIN "describe_expression" T14
    ON ("describe_description"."id" = T14."description_id")
 INNER JOIN "describe_expression" T16
    ON ("describe_description"."id" = T16."description_id")
 INNER JOIN "describe_expression" T18
    ON ("describe_description"."id" = T18."description_id")
 WHERE ("describe_expression"."class_id" IN (732) AND T4."class_id" IN (740) AND T6."class_id" IN (760) AND T8."class_id" IN (783) AND T10."class_id" IN (794) AND T12."class_id" IN (851) AND T14."class_id" IN (857) AND T16."class_id" IN (860) AND T18."class_id" IN (874))
like image 523
kofm Avatar asked Nov 15 '25 19:11

kofm


2 Answers

First you should avoid multiple joins by aggregating desired filters upfront:

filters = [
  {'trait': 1, 'class': [2, 3]},
  {'trait': 2, 'class': [6,]},
]

queryset = Description.objects.all()
class_filter = []
for filter_entry in filters:
    class_filter.append(filter_entry["class"])
queryset = queryset.filter(expression_set__class__in=class_filter)

Second problem is scanning for text values. Use db_index=True on your Class.name field.

EDIT: There is a difference in chaining the filters on the same table and using Q objects. It does not act like AND on the same object. It seems counterintuitive as in the sql you can see AND but that is AND on multiple joins, where each join effectively duplicates the descriptions (that is why it get's slow). Best explained in the Django docs or this article.

Quick excerpt of the docs:

To select all blogs containing at least one entry from 2008 having “Lennon” in its headline (the same entry satisfying both conditions), we would write:

Blog.objects.filter(entry__headline__contains='Lennon', entry__pub_date__year=2008)

Otherwise, to perform a more permissive query selecting any blogs with merely some entry with “Lennon” in its headline and some entry from 2008, we would write:

Blog.objects.filter(entry__headline__contains='Lennon').filter(entry__pub_date__year=2008)

EDIT 2: Schematic example of above from this answer:

enter image description here

Blog.objects.filter(entry__headline_contains='Lennon', 
entry__pub_date__year=2008)

filters only Blog 1

Blog.objects.filter(entry__headline_contains='Lennon').filter(
entry__pub_date__year=2008)  

filters blog 1 and 2

like image 108
okolimar Avatar answered Nov 17 '25 10:11

okolimar


It's slightly better to use multiple functions instead I think. It runs at the same speed as using classes, if not even faster. Check this question out. After you start using functions, you can try using @cached_property(func, name=None):

It’s common to have to call a class instance’s method more than once. If that function is expensive, then doing so can be wasteful.

Using the cached_property decorator saves the value returned by a property; the next time the function is called on that instance, it will return the saved value rather than re-computing it. Note that this only works on methods that take self as their only argument and that it changes the method to a property.

Consider a typical case, where a view might need to call a model’s method to perform some computation, before placing the model instance into the context, where the template might invoke the method once more:

# the model
class Person(models.Model):

    def friends(self):
        # expensive computation
        ...
        return friends

# in the view:
if person.friends():
    ...

And in the template you would have:

{% for friend in person.friends %}

Here, friends() will be called twice. Since the instance person in the view and the template are the same, decorating the friends() method with @cached_property can avoid that:

from django.utils.functional import cached_property

class Person(models.Model):

    @cached_property
    def friends(self):
        ...

Relevant questions and sources:

  • Caching class attributes in Python
  • How much slower python classes are compared to their equivalent functions?
  • djangoproject docs - django.utils.functional.cached_property
like image 35
DialFrost Avatar answered Nov 17 '25 10:11

DialFrost



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!