Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy text search in Python

I am wondering if there is a Python library can conduct fuzzy text search. For example:

  • I have three keywords "letter", "stamp", and "mail".
  • I would like to have a function to check if those three words are within the same paragraph (or certain distances, one page).
  • In addition, those words have to maintain the same order. It is fine that other words appear between those three words.

I have tried fuzzywuzzy which did not solve my problem. Another library, Whoosh, looks powerful, but I did not find the proper function.

like image 759
TTT Avatar asked May 26 '15 04:05

TTT


People also ask

What is fuzzy search example?

Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term. For example the search term bank~ will return rows that contain tank , benk or banks .

What is fuzzy matching example?

Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.


1 Answers

{1} You can do this in Whoosh 2.7. It has fuzzy search by adding the plugin whoosh.qparser.FuzzyTermPlugin:

whoosh.qparser.FuzzyTermPlugin lets you search for “fuzzy” terms, that is, terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of “edits” (character insertions, deletions, and/or transpositions – this is called the “Damerau-Levenshtein edit distance”).

To add the fuzzy plugin:

parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())

Once you add the fuzzy plugin to the parser, you can specify a fuzzy term by adding a ~ followed by an optional maximum edit distance. If you don’t specify an edit distance, the default is 1.

For example, the following “fuzzy” term query:

letter~
letter~2
letter~2/3

{2} To keep words in order, use the Query whoosh.query.Phrase but you should replace Phrase plugin by whoosh.qparser.SequencePlugin that allows you to use fuzzy terms inside a phrase:

"letter~ stamp~ mail~"

To replace the default phrase plugin with the sequence plugin:

parser = qparser.QueryParser("fieldname", my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())

{3} To allow words between, initialize the slop arg in your Phrase query to a greater number:

whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)

slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.

You can also define slop in Query like this:

"letter~ stamp~ mail~"~10

{4} Overall solution:

{4.a} Indexer would be like:

from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")
writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")
writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")
writer.add_document(title=u"Fivth document", content=u"letter first,  mail third")
writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")
writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")
writer.commit()

{4.b} Searcher would be like:

from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin

with ix.searcher() as searcher:
    parser = QueryParser(u"content", ix.schema)
    parser.add_plugin(FuzzyTermPlugin())
    parser.remove_plugin_class(PhrasePlugin)
    parser.add_plugin(SequencePlugin())
    query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")
    results = searcher.search(query)
    print "nb of results =", len(results)
    for r in results:
        print r

That gives the result:

nb of results = 2
<Hit {'title': u'Sixth document'}>
<Hit {'title': u'Third document'}>

{5} If you want to set fuzzy search as default without using the syntax word~n in each word of the query, you can initialize QueryParser like this:

 from whoosh.query import FuzzyTerm
 parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)

Now you can use the query "letter stamp mail"~10 but keep in mind that FuzzyTerm has default edit distance maxdist = 1. Personalize the class if you want bigger edit distance:

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore) 
         # super().__init__() for Python 3 I think

References:

  1. whoosh.query.Phrase
  2. Adding fuzzy term queries
  3. Allowing complex phrase queries
  4. class whoosh.query.FuzzyTerm
  5. qparser module
like image 119
Assem Avatar answered Oct 10 '22 09:10

Assem