Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django ORM: How can I sort by date and then select the best of the objects grouped by a foreign key?

Tags:

I realize my title is kind of complex, but please allow me to demonstrate. I'm on Django 2.2.5 with Python 3. Here are the models I'm currently working with:

from django.db import models
from django.db.models import F
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVectorField, SearchVector, SearchQuery, SearchRank

class Thread(models.Model):
    title = models.CharField(max_length=100)
    last_update = models.DateTimeField(auto_now=True)

class PostQuerySet(models.QuerySet):
    _search_vector = SearchVector('thread__type') + \
                     SearchVector('thread__title') + \
                     SearchVector('from_name') + \
                     SearchVector('from_email') + \
                     SearchVector('message')

    ###
    # There's code here that updates the `Post.search_vector` field for each `Post` object
    # using `PostQuerySet._search_vector`.
    ###

    def search(self, text):
        """
            Search posts using the indexed `search_vector` field. I can, for example, call
            `Post.objects.search('influenza h1n1')`.
        """
        search_query = SearchQuery(text)
        search_rank = SearchRank(F('search_vector'), search_query)
        return self.annotate(rank=search_rank).filter(search_vector=search_query).order_by('-rank')

class Post(models.Model):
    thread = models.ForeignKey(Thread, on_delete=models.CASCADE)
    timestamp = models.DateTimeField()
    from_name = models.CharField(max_length=100)
    from_email = models.EmailField()
    message = models.TextField()
    in_response_to = models.ManyToManyField('self', symmetrical=False, blank=True)
    search_vector = SearchVectorField(null=True)

    objects = PostQuerySet().as_manager()

    class Meta:
        ordering = ['timestamp']
        indexes = [
            GinIndex(fields=['search_vector'])
        ]

(There's some stuff in these models I've cut for brevity and what I believe is irrelevance, but if it becomes important later on, I'll add it in.)

In English, I'm working with an app that represents the data in an email listserv. Basically, there's a Thread that contains multiple Post objects; people reply-all to the initial post and create a discussion. I've just finished implementing a search capability using the built-in support Django offers for full-text search in Django. It's super fast, and I love it. Here's an example of me searching in views.py:

###
# Pull `query` from a form defined in `forms.py`.
###

search_results = Post.objects.search(query).order_by('-timestamp')

This is all fine and dandy and returns search results that definitely make sense. But I've just encountered a situation I'm not quite sure how to handle. The displayed results just aren't quite as clean as I'd like. What this query gets me is all Post objects that match the user-provided query. That's fine, but there may be many Post objects within the same Thread that clog up the results. It might be something like this:

post5 from thread2 - timestamp 2018-04-01, rank 0.5
post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post3 from thread1 - timestamp 2018-01-01, rank 0.6
post2 from thread1 - timestamp 2017-12-01, rank 0.7
post2 from thread2 - timestamp 2017-11-01, rank 0.7

(Here, rank is the relevance returned by Django's SearchRank method.)

What I really want is this: I want to display the most representative matching Post for each Thread, sorted by descending timestamp. In other words, for each Thread containing a Post in the search results, only the highest rank Post should be displayed, and those highest rank Post objects should be sorted by timestamp in descending order. So in the above example, these are the results I'd like to see:

post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post2 from thread1 - timestamp 2017-12-01, rank 0.7

It would be fairly straightforward to do what I want to do with a few for loops, but I'm really hoping there's a way to accomplish this purely in the ORM for efficiency. Do any of you guys have any suggestions? Please let me know if you need me to clarify anything about the problem setup or what I want.

like image 659
Geoff Avatar asked Sep 17 '19 03:09

Geoff


People also ask

How do I sort by date in Django?

This answer Django Datetime Field Query - Order by Time/Hour suggests that I use '__day' with my date_modified as in: Article. objects. filter().

What is __ in Django ORM?

In Django, above argument is called field lookups argument, the field lookups argument's format should be fieldname__lookuptype=value. Please note the double underscore ( __ ) between the field name(depe_desc) and lookup type keyword contains.

What is Select_related in Django?

Using select_related() Django offers a QuerySet method called select_related() that allows you to retrieve related objects for one-to-many relationships. This translates to a single, more complex QuerySet, but you avoid additional queries when accessing the related objects.


2 Answers

I think you have to query the Post model ordering it by thread, rank and timestamp and then use distinct on thread.

Search

This is your search sorted by timestamp:

Post.objects.search("text").order_by("-timestamp")

this is the SQL executed on my local PostgreSQL:

SELECT
    "post"."from_name",
    "thread"."title",
    "post"."timestamp",
    ts_rank("post"."search_vector", plainto_tsquery('text')) AS "rank"
FROM
    "post"
    INNER JOIN "thread" ON ("post"."thread_id" = "thread"."id")
WHERE
    "post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
    "post"."timestamp" DESC

and these are the search results with my local data:

post1 from thread1 - timestamp 2019-07-01, rank 0.0607927
post2 from thread1 - timestamp 2019-06-01, rank 0.0759909
post1 from thread2 - timestamp 2019-06-01, rank 0.0759909
post2 from thread2 - timestamp 2019-05-01, rank 0.0607927
post3 from thread1 - timestamp 2019-05-01, rank 0.0607927
post1 from thread3 - timestamp 2019-05-01, rank 0.0607927
post3 from thread2 - timestamp 2019-04-01, rank 0.0759909
post4 from thread1 - timestamp 2019-04-01, rank 0.0759909
post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
post5 from thread1 - timestamp 2019-03-01, rank 0.0607927
post3 from thread3 - timestamp 2019-03-01, rank 0.0607927
post4 from thread2 - timestamp 2019-03-01, rank 0.0607927
post5 from thread2 - timestamp 2019-02-01, rank 0.0759909
post4 from thread3 - timestamp 2019-02-01, rank 0.0759909
post5 from thread3 - timestamp 2019-01-01, rank 0.0759909

Solution

This is the correct query to display only the most representative matching Post (based on the search rank) for each Thread, sorted by descending timestamp

Post.objects.search("text").order_by(
   "thread", "-rank", "-timestamp"
).distinct("thread")

this is the SQL executed on my local PostgreSQL:

SELECT DISTINCT ON ("forum_post"."thread_id")
    "forum_post"."from_name",
    "forum_thread"."title",
    "forum_post"."timestamp",
    ts_rank("forum_post"."search_vector", plainto_tsquery('dolor')) AS "rank"
FROM
    "forum_post"
    INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id")
WHERE
    "forum_post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
    "forum_post"."thread_id" ASC,
    "rank" DESC,
    "forum_post"."timestamp" DESC

and these are the search results with my local data:

post2 from thread1 - timestamp 2019-06-01, rank 0.0759909
post1 from thread2 - timestamp 2019-06-01, rank 0.0759909
post2 from thread3 - timestamp 2019-04-01, rank 0.0759909

Note

You can read more about distinct on the official Django documentation.

Update

If you need absolutely order by timestamp in reverse order and if you don't need to display the rank, you can use subquery to order your posts after the previous query:

Post.objects.filter(
    pk__in=Subquery(
        Post.objects.search("dolor")
        .order_by("-thread", "-rank", "-timestamp")
        .distinct("thread")
        .values("id")
    )
).order_by("-timestamp")

this is the SQL executed on my local PostgreSQL:

SELECT
    "forum_post"."from_name",
    "forum_thread"."title",
    "forum_post"."timestamp"
FROM
    "forum_post"
    INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id")
WHERE
    "forum_post"."id" IN ( SELECT DISTINCT ON (U0. "thread_id")
            U0. "id"
        FROM
            "forum_post" U0
        WHERE
            U0. "search_vector" @@ (plainto_tsquery('dolor')) = TRUE
        ORDER BY
            U0. "thread_id" DESC,
            ts_rank(U0. "search_vector", plainto_tsquery('dolor'))
            DESC,
            U0. "timestamp" DESC)
ORDER BY
    "forum_post"."timestamp" DESC

and these are the search results with my local data:

post2 from thread1 - timestamp 2019-06-01
post1 from thread2 - timestamp 2019-06-01
post2 from thread3 - timestamp 2019-04-01
like image 56
Paolo Melchiorre Avatar answered Sep 29 '22 02:09

Paolo Melchiorre


I think we can use distinct to select only first row in group of results.

Can you try something like this:

results = Thread.objects.filter(post__search_vector=search_query) \
    .annotate(rank=search_rank) \
    .order_by('id', '-rank') \
    .distinct('id')

# Then sort these **limited results** by rank manually in python instead of by thread id
# The performance of this should be much better than looping over all results in Python

I am unable to test it as I don't have required Django models setup in place. Please share the output of print(results.query) for above.

like image 34
Pratyush Avatar answered Sep 29 '22 03:09

Pratyush