I realize my title is kind of complex, but please allow me to demonstrate. I'm on Django 2.2.5 with Python 3. Here are the models I'm currently working with:
from django.db import models
from django.db.models import F
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVectorField, SearchVector, SearchQuery, SearchRank
class Thread(models.Model):
title = models.CharField(max_length=100)
last_update = models.DateTimeField(auto_now=True)
class PostQuerySet(models.QuerySet):
_search_vector = SearchVector('thread__type') + \
SearchVector('thread__title') + \
SearchVector('from_name') + \
SearchVector('from_email') + \
SearchVector('message')
###
# There's code here that updates the `Post.search_vector` field for each `Post` object
# using `PostQuerySet._search_vector`.
###
def search(self, text):
"""
Search posts using the indexed `search_vector` field. I can, for example, call
`Post.objects.search('influenza h1n1')`.
"""
search_query = SearchQuery(text)
search_rank = SearchRank(F('search_vector'), search_query)
return self.annotate(rank=search_rank).filter(search_vector=search_query).order_by('-rank')
class Post(models.Model):
thread = models.ForeignKey(Thread, on_delete=models.CASCADE)
timestamp = models.DateTimeField()
from_name = models.CharField(max_length=100)
from_email = models.EmailField()
message = models.TextField()
in_response_to = models.ManyToManyField('self', symmetrical=False, blank=True)
search_vector = SearchVectorField(null=True)
objects = PostQuerySet().as_manager()
class Meta:
ordering = ['timestamp']
indexes = [
GinIndex(fields=['search_vector'])
]
(There's some stuff in these models I've cut for brevity and what I believe is irrelevance, but if it becomes important later on, I'll add it in.)
In English, I'm working with an app that represents the data in an email listserv. Basically, there's a Thread
that contains multiple Post
objects; people reply-all to the initial post and create a discussion. I've just finished implementing a search capability using the built-in support Django offers for full-text search in Django. It's super fast, and I love it. Here's an example of me searching in views.py
:
###
# Pull `query` from a form defined in `forms.py`.
###
search_results = Post.objects.search(query).order_by('-timestamp')
This is all fine and dandy and returns search results that definitely make sense. But I've just encountered a situation I'm not quite sure how to handle. The displayed results just aren't quite as clean as I'd like. What this query gets me is all Post
objects that match the user-provided query
. That's fine, but there may be many Post
objects within the same Thread
that clog up the results. It might be something like this:
post5 from thread2 - timestamp 2018-04-01, rank 0.5
post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post3 from thread1 - timestamp 2018-01-01, rank 0.6
post2 from thread1 - timestamp 2017-12-01, rank 0.7
post2 from thread2 - timestamp 2017-11-01, rank 0.7
(Here, rank
is the relevance returned by Django's SearchRank
method.)
What I really want is this: I want to display the most representative matching Post
for each Thread
, sorted by descending timestamp. In other words, for each Thread
containing a Post
in the search results, only the highest rank
Post
should be displayed, and those highest rank
Post
objects should be sorted by timestamp in descending order. So in the above example, these are the results I'd like to see:
post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post2 from thread1 - timestamp 2017-12-01, rank 0.7
It would be fairly straightforward to do what I want to do with a few for
loops, but I'm really hoping there's a way to accomplish this purely in the ORM for efficiency. Do any of you guys have any suggestions? Please let me know if you need me to clarify anything about the problem setup or what I want.
This answer Django Datetime Field Query - Order by Time/Hour suggests that I use '__day' with my date_modified as in: Article. objects. filter().
In Django, above argument is called field lookups argument, the field lookups argument's format should be fieldname__lookuptype=value. Please note the double underscore ( __ ) between the field name(depe_desc) and lookup type keyword contains.
Using select_related() Django offers a QuerySet method called select_related() that allows you to retrieve related objects for one-to-many relationships. This translates to a single, more complex QuerySet, but you avoid additional queries when accessing the related objects.
I think you have to query the Post model ordering it by thread, rank and timestamp and then use distinct
on thread.
This is your search sorted by timestamp:
Post.objects.search("text").order_by("-timestamp")
this is the SQL executed on my local PostgreSQL:
SELECT
"post"."from_name",
"thread"."title",
"post"."timestamp",
ts_rank("post"."search_vector", plainto_tsquery('text')) AS "rank"
FROM
"post"
INNER JOIN "thread" ON ("post"."thread_id" = "thread"."id")
WHERE
"post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
"post"."timestamp" DESC
and these are the search results with my local data:
post1 from thread1 - timestamp 2019-07-01, rank 0.0607927
post2 from thread1 - timestamp 2019-06-01, rank 0.0759909
post1 from thread2 - timestamp 2019-06-01, rank 0.0759909
post2 from thread2 - timestamp 2019-05-01, rank 0.0607927
post3 from thread1 - timestamp 2019-05-01, rank 0.0607927
post1 from thread3 - timestamp 2019-05-01, rank 0.0607927
post3 from thread2 - timestamp 2019-04-01, rank 0.0759909
post4 from thread1 - timestamp 2019-04-01, rank 0.0759909
post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
post5 from thread1 - timestamp 2019-03-01, rank 0.0607927
post3 from thread3 - timestamp 2019-03-01, rank 0.0607927
post4 from thread2 - timestamp 2019-03-01, rank 0.0607927
post5 from thread2 - timestamp 2019-02-01, rank 0.0759909
post4 from thread3 - timestamp 2019-02-01, rank 0.0759909
post5 from thread3 - timestamp 2019-01-01, rank 0.0759909
This is the correct query to display only the most representative matching Post (based on the search rank) for each Thread, sorted by descending timestamp
Post.objects.search("text").order_by( "thread", "-rank", "-timestamp" ).distinct("thread")
this is the SQL executed on my local PostgreSQL:
SELECT DISTINCT ON ("forum_post"."thread_id") "forum_post"."from_name", "forum_thread"."title", "forum_post"."timestamp", ts_rank("forum_post"."search_vector", plainto_tsquery('dolor')) AS "rank" FROM "forum_post" INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id") WHERE "forum_post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE ORDER BY "forum_post"."thread_id" ASC, "rank" DESC, "forum_post"."timestamp" DESC
and these are the search results with my local data:
post2 from thread1 - timestamp 2019-06-01, rank 0.0759909 post1 from thread2 - timestamp 2019-06-01, rank 0.0759909 post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
You can read more about distinct
on the official Django documentation.
If you need absolutely order by timestamp in reverse order and if you don't need to display the rank, you can use subquery to order your posts after the previous query:
Post.objects.filter( pk__in=Subquery( Post.objects.search("dolor") .order_by("-thread", "-rank", "-timestamp") .distinct("thread") .values("id") ) ).order_by("-timestamp")
this is the SQL executed on my local PostgreSQL:
SELECT "forum_post"."from_name", "forum_thread"."title", "forum_post"."timestamp" FROM "forum_post" INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id") WHERE "forum_post"."id" IN ( SELECT DISTINCT ON (U0. "thread_id") U0. "id" FROM "forum_post" U0 WHERE U0. "search_vector" @@ (plainto_tsquery('dolor')) = TRUE ORDER BY U0. "thread_id" DESC, ts_rank(U0. "search_vector", plainto_tsquery('dolor')) DESC, U0. "timestamp" DESC) ORDER BY "forum_post"."timestamp" DESC
and these are the search results with my local data:
post2 from thread1 - timestamp 2019-06-01 post1 from thread2 - timestamp 2019-06-01 post2 from thread3 - timestamp 2019-04-01
I think we can use distinct
to select only first row in group of results.
Can you try something like this:
results = Thread.objects.filter(post__search_vector=search_query) \
.annotate(rank=search_rank) \
.order_by('id', '-rank') \
.distinct('id')
# Then sort these **limited results** by rank manually in python instead of by thread id
# The performance of this should be much better than looping over all results in Python
I am unable to test it as I don't have required Django models setup in place. Please share the output of print(results.query)
for above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With