Fuzzy text search in Python

1 Answers

{1} You can do this in Whoosh 2.7. It has fuzzy search by adding the plugin whoosh.qparser.FuzzyTermPlugin:

whoosh.qparser.FuzzyTermPlugin lets you search for “fuzzy” terms, that is, terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of “edits” (character insertions, deletions, and/or transpositions – this is called the “Damerau-Levenshtein edit distance”).

To add the fuzzy plugin:

Click to copy

parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())

Once you add the fuzzy plugin to the parser, you can specify a fuzzy term by adding a ~ followed by an optional maximum edit distance. If you don’t specify an edit distance, the default is 1.

For example, the following “fuzzy” term query:

Click to copy

letter~
letter~2
letter~2/3

{2} To keep words in order, use the Query whoosh.query.Phrase but you should replace Phrase plugin by whoosh.qparser.SequencePlugin that allows you to use fuzzy terms inside a phrase:

Click to copy

"letter~ stamp~ mail~"

To replace the default phrase plugin with the sequence plugin:

Click to copy

parser = qparser.QueryParser("fieldname", my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())

{3} To allow words between, initialize the slop arg in your Phrase query to a greater number:

Click to copy

whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)

slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.

You can also define slop in Query like this:

Click to copy

"letter~ stamp~ mail~"~10

{4} Overall solution:

{4.a} Indexer would be like:

Click to copy

from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")
writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")
writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")
writer.add_document(title=u"Fivth document", content=u"letter first,  mail third")
writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")
writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")
writer.commit()

{4.b} Searcher would be like:

Click to copy

from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin

with ix.searcher() as searcher:
    parser = QueryParser(u"content", ix.schema)
    parser.add_plugin(FuzzyTermPlugin())
    parser.remove_plugin_class(PhrasePlugin)
    parser.add_plugin(SequencePlugin())
    query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")
    results = searcher.search(query)
    print "nb of results =", len(results)
    for r in results:
        print r

That gives the result:

Click to copy

nb of results = 2
<Hit {'title': u'Sixth document'}>
<Hit {'title': u'Third document'}>

{5} If you want to set fuzzy search as default without using the syntax word~n in each word of the query, you can initialize QueryParser like this:

Click to copy

 from whoosh.query import FuzzyTerm
 parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)

Now you can use the query "letter stamp mail"~10 but keep in mind that FuzzyTerm has default edit distance maxdist = 1. Personalize the class if you want bigger edit distance:

Click to copy

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore) 
         # super().__init__() for Python 3 I think

References:

whoosh.query.Phrase
Adding fuzzy term queries
Allowing complex phrase queries
class whoosh.query.FuzzyTerm
qparser module

119

answered Oct 10 '22 09:10

Assem

Related questions
                            
                                Get the average year (mean of days over multiple years) in Pandas
                            
                                using Django Rest framework to serialize custom data types and return response
                            
                                Scrapy: Pass arguments to cmdline.execute()
                            
                                ImportError: No module named 'Crypto'
                            
                                Business Opening hours in Django
                            
                                Django ORM calculate number of days between two date attributes
                            
                                How to get COUNT query in django
                            
                                searching a namedtuple like a dictionary
                            
                                Checking if a list contains a certain sequence of numbers
                            
                                Python semicolon does make a difference
                            
                                Provide a default for ForeignKey field on existing entries in Django
                            
                                How to catch - 'NoneType' object has no attribute 'something'
                            
                                Can the name and the reference of a named tuple be different?
                            
                                Functions from Python packages for udf() of Spark dataframe
                            
                                Using Angular JS(Protractor) with Selenium in Python
                            
                                Minimizing a multivariable function with scipy. Derivative not known
                            
                                Python - Raw String Literals
                            
                                Python OpenCV drawing errors after manipulating array with numpy
                            
                                Scatter a 2D numpy array in matplotlib
                            
                                Get previous object without len(list)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fuzzy text search in Python

Tags:

python

full-text-search

fuzzy-search

TTT

People also ask

1 Answers

Assem

Recent Activity

Donate For Us