I have to implement a search function which will be fault tolerant.
Currently, I have the following situation:
Models:
class Tag(models.Model):
name = models.CharField(max_length=255)
class Illustration(models.Model):
name = models.CharField(max_length=255)
tags = models.ManyToManyField(Tag)
Query:
queryset.annotate(similarity=TrigramSimilarity('name', fulltext) + TrigramSimilarity('tags__name', fulltext))
Example data:
Illustrations:
ID | Name | Tags |
---|--------|-------------------|
1 | "Dog" | "Animal", "Brown" |
2 | "Cat" | "Animals" |
Illustration has Tags:
ID_Illustration | ID_Tag |
----------------|--------|
1 | 1 |
1 | 2 |
2 | 3 |
Tags:
ID_Tag | Name |
-------|----------|
1 | Animal |
2 | Brown |
3 | Animals |
When I run the query with "Animal"
, the similarity for "Dog"
should be higher than for "Cat"
, as it is a perfect match.
Unfortunately, both tags are considered together somehow.
Currently, it looks like it's concatenating the tags in a single string and then checks for similarity:
TrigramSimilarity("Animal Brown", "Animal") => X
But I would like to adjust it in a way that I will get the highest similarity between an Illustration
instance name and its tags:
Max([
TrigramSimilarity('Name', "Animal"),
TrigramSimilarity("Tag_1", "Animal"),
TrigramSimilarity("Tag_2", "Animal"),
]) => X
Edit1: I'm trying to query all Illustration, where either the title or one of the tags has a similarity bigger than X.
Edit2: Additional example:
fulltext = 'Animal'
TrigramSimilarity('Animal Brown', fulltext) => x TrigramSimilarity('Animals', fulltext) => y
Where x < y
But what I want is actually
TrigramSimilarity(Max(['Animal', 'Brown]), fulltext) => x (Similarity to Animal) TrigramSimilarity('Animals', fulltext) => y
Where x > y
You cannot break up the tags__name
(at least I don't know a way).
From your examples, I can assume 2 possible solutions (1st solution is not strictly using Django):
Not everything needs to pass strictly through Django
We have Python powers, so let's use them:
Let us compose the query first:
from difflib import SequenceMatcher
from django.db.models import Q
def create_query(fulltext):
illustration_names = Illustration.objects.values_list('name', flat=True)
tag_names = Tag.objects.values_list('name', flat=True)
query = []
for name in illustration_names:
score = SequenceMatcher(None, name, fulltext).ratio()
if score == 1:
# Perfect Match for name
return [Q(name=name)]
if score >= THRESHOLD:
query.append(Q(name=name))
for name in tag_names:
score = SequenceMatcher(None, name, fulltext).ratio()
if score == 1:
# Perfect Match for name
return [Q(tags__name=name)]
if score >= THRESHOLD:
query.append(Q(tags__name=name))
return query
Then to create your queryset:
from functools import reduce # Needed only in python 3
from operator import or_
queryset = Illustration.objects.filter(reduce(or_, create_query(fulltext)))
Decode the above:
We are checking every Illustration
and Tag
name against our fulltext
and we are composing a query with every name that it's similarity passes the THRESHOLD
.
SequenceMatcher
method compares sequences and returns a ratio 0 < ratio < 1
where 0 indicates No-Match and 1 indicates Perfect-Match. Check this answer for another usage example: Find the similarity percent between two strings (Note: There are other strings comparing modules as well, find one that suits you)Q()
Django objects, allow the creation of complex queries (more on the linked docs).operator
and reduce
we transform a list of Q()
objects to an OR separated query argument: Q(name=name_1) | Q(name=name_2) | ... | Q(tag_name=tag_name_1) | ...
Note:
You need to define an acceptable THRESHOLD
.
As you can imagine this will be a bit slow but it is to be expected when you need to do a "fuzzy" search.
(The Django Way:)
Use a query with a high similarity threshold and order the queryset by this similarity rate:
queryset.annotate(
similarity=Greatest(
TrigramSimilarity('name', fulltext),
TrigramSimilarity('tags__name', fulltext)
)).filter(similarity__gte=threshold).order_by('-similarity')
Decode the above:
Greatest()
accepts an aggregation (not to be confused with the Django method aggregate
) of expressions or of model fields and returns the max item.TrigramSimilarity(word, search)
returns a rate between 0 and 1. The closer the rate is to 1, the more similar the word
is to search
..filter(similarity__gte=threshold)
, will filter similarities lower than the threshold
.0 < threshold < 1
. You can set the threshold to 0.6
which is pretty high (consider that the default is 0.3
). You can play around with that to tune your performance.
similarity
rate in a descending order.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With