I am wondering if there is a Python library can conduct fuzzy text search. For example:
I have tried fuzzywuzzy which did not solve my problem. Another library, Whoosh, looks powerful, but I did not find the proper function.
Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term. For example the search term bank~ will return rows that contain tank , benk or banks .
Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.
{1}
You can do this in Whoosh 2.7
. It has fuzzy search by adding the plugin whoosh.qparser.FuzzyTermPlugin
:
whoosh.qparser.FuzzyTermPlugin
lets you search for “fuzzy” terms, that is, terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of “edits” (character insertions, deletions, and/or transpositions – this is called the “Damerau-Levenshtein edit distance”).
To add the fuzzy plugin:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())
Once you add the fuzzy plugin to the parser, you can specify a fuzzy term by adding a ~
followed by an optional maximum edit distance. If you don’t specify an edit distance, the default is 1.
For example, the following “fuzzy” term query:
letter~
letter~2
letter~2/3
{2} To keep words in order, use the Query whoosh.query.Phrase
but you should replace Phrase
plugin by whoosh.qparser.SequencePlugin
that allows you to use fuzzy terms inside a phrase:
"letter~ stamp~ mail~"
To replace the default phrase plugin with the sequence plugin:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())
{3} To allow words between, initialize the slop
arg in your Phrase query to a greater number:
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.
You can also define slop in Query like this:
"letter~ stamp~ mail~"~10
{4} Overall solution:
{4.a} Indexer would be like:
from whoosh.index import create_in
from whoosh.fields import *
schema = Schema(title=TEXT(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")
writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")
writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")
writer.add_document(title=u"Fivth document", content=u"letter first, mail third")
writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")
writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")
writer.commit()
{4.b} Searcher would be like:
from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin
with ix.searcher() as searcher:
parser = QueryParser(u"content", ix.schema)
parser.add_plugin(FuzzyTermPlugin())
parser.remove_plugin_class(PhrasePlugin)
parser.add_plugin(SequencePlugin())
query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")
results = searcher.search(query)
print "nb of results =", len(results)
for r in results:
print r
That gives the result:
nb of results = 2
<Hit {'title': u'Sixth document'}>
<Hit {'title': u'Third document'}>
{5} If you want to set fuzzy search as default without using the syntax word~n
in each word of the query, you can initialize QueryParser
like this:
from whoosh.query import FuzzyTerm
parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)
Now you can use the query "letter stamp mail"~10
but keep in mind that FuzzyTerm
has default edit distance maxdist = 1
. Personalize the class if you want bigger edit distance:
class MyFuzzyTerm(FuzzyTerm):
def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)
# super().__init__() for Python 3 I think
References:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With