Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django max similarity (TrigramSimilarity) from ManyToManyField

I have to implement a search function which will be fault tolerant.
Currently, I have the following situation:

Models:

class Tag(models.Model):
    name = models.CharField(max_length=255)

class Illustration(models.Model):
    name = models.CharField(max_length=255)
    tags = models.ManyToManyField(Tag)

Query:

queryset.annotate(similarity=TrigramSimilarity('name', fulltext) + TrigramSimilarity('tags__name', fulltext))

Example data:

Illustrations:

ID |  Name  |        Tags       |
---|--------|-------------------|
 1 | "Dog"  | "Animal", "Brown" |
 2 | "Cat"  | "Animals"         |

Illustration has Tags:

ID_Illustration | ID_Tag |
----------------|--------|
       1        |    1   |
       1        |    2   |
       2        |    3   |

Tags:

ID_Tag |   Name   |
-------|----------|
   1   |  Animal  |
   2   |  Brown   |
   3   |  Animals |

When I run the query with "Animal", the similarity for "Dog" should be higher than for "Cat", as it is a perfect match.
Unfortunately, both tags are considered together somehow.
Currently, it looks like it's concatenating the tags in a single string and then checks for similarity:

TrigramSimilarity("Animal Brown", "Animal") => X

But I would like to adjust it in a way that I will get the highest similarity between an Illustration instance name and its tags:

Max([
    TrigramSimilarity('Name', "Animal"), 
    TrigramSimilarity("Tag_1", "Animal"), 
    TrigramSimilarity("Tag_2", "Animal"),
]) => X

Edit1: I'm trying to query all Illustration, where either the title or one of the tags has a similarity bigger than X.

Edit2: Additional example:

fulltext = 'Animal'

TrigramSimilarity('Animal Brown', fulltext) => x TrigramSimilarity('Animals', fulltext) => y

Where x < y

But what I want is actually

TrigramSimilarity(Max(['Animal', 'Brown]), fulltext) => x (Similarity to Animal) TrigramSimilarity('Animals', fulltext) => y

Where x > y

like image 964
Lukas Avatar asked Feb 03 '18 23:02

Lukas


1 Answers

You cannot break up the tags__name (at least I don't know a way).
From your examples, I can assume 2 possible solutions (1st solution is not strictly using Django):


  1. Not everything needs to pass strictly through Django
    We have Python powers, so let's use them:

    Let us compose the query first:

    from difflib import SequenceMatcher
    
    from django.db.models import Q
    
    def create_query(fulltext):
        illustration_names = Illustration.objects.values_list('name', flat=True)
        tag_names = Tag.objects.values_list('name', flat=True)
        query = []
    
        for name in illustration_names:
            score = SequenceMatcher(None, name, fulltext).ratio()
            if score == 1:
                # Perfect Match for name
                return [Q(name=name)]
    
             if score >= THRESHOLD:
                query.append(Q(name=name))
    
        for name in tag_names:
            score = SequenceMatcher(None, name, fulltext).ratio()
            if score == 1:
                # Perfect Match for name
                return [Q(tags__name=name)]
    
             if score >= THRESHOLD:
                query.append(Q(tags__name=name))
    
        return query
    

    Then to create your queryset:

    from functools import reduce # Needed only in python 3
    from operator import or_
    
    queryset = Illustration.objects.filter(reduce(or_, create_query(fulltext)))
    

    Decode the above:

    We are checking every Illustration and Tag name against our fulltext and we are composing a query with every name that it's similarity passes the THRESHOLD.

    • SequenceMatcher method compares sequences and returns a ratio 0 < ratio < 1 where 0 indicates No-Match and 1 indicates Perfect-Match. Check this answer for another usage example: Find the similarity percent between two strings (Note: There are other strings comparing modules as well, find one that suits you)
    • Q() Django objects, allow the creation of complex queries (more on the linked docs).
    • With the operator and reduce we transform a list of Q() objects to an OR separated query argument:
      Q(name=name_1) | Q(name=name_2) | ... | Q(tag_name=tag_name_1) | ...

    Note: You need to define an acceptable THRESHOLD.
    As you can imagine this will be a bit slow but it is to be expected when you need to do a "fuzzy" search.


  1. (The Django Way:)
    Use a query with a high similarity threshold and order the queryset by this similarity rate:

    queryset.annotate(
        similarity=Greatest(
            TrigramSimilarity('name', fulltext), 
            TrigramSimilarity('tags__name', fulltext)
        )).filter(similarity__gte=threshold).order_by('-similarity')
    

    Decode the above:

    • Greatest() accepts an aggregation (not to be confused with the Django method aggregate) of expressions or of model fields and returns the max item.
    • TrigramSimilarity(word, search) returns a rate between 0 and 1. The closer the rate is to 1, the more similar the word is to search.
    • .filter(similarity__gte=threshold), will filter similarities lower than the threshold.
    • 0 < threshold < 1. You can set the threshold to 0.6 which is pretty high (consider that the default is 0.3). You can play around with that to tune your performance.
    • Finally, order the queryset by the similarity rate in a descending order.
like image 81
John Moutafis Avatar answered Sep 29 '22 12:09

John Moutafis