I'm looking for a library or a method using existing libraries( difflib
, fuzzywuzzy
, python-levenshtein
) to find the closest match of a string (query
) in a text (corpus
)
I've developped a method based on difflib
, where I split my corpus
into ngrams of size n
(length of query
).
import difflib
from nltk.util import ngrams
def get_best_match(query, corpus):
ngs = ngrams( list(corpus), len(query) )
ngrams_text = [''.join(x) for x in ngs]
return difflib.get_close_matches(query, ngrams_text, n=1, cutoff=0)
it works as I want when the difference between the query and the matched string are just character replacements.
query = "ipsum dolor"
corpus = "lorem 1psum d0l0r sit amet"
match = get_best_match(query, corpus)
# match = "1psum d0l0r"
But when the difference is character deletion, it is not.
query = "ipsum dolor"
corpus = "lorem 1psum dlr sit amet"
match = get_best_match(query, corpus)
# match = "psum dlr si"
# expected_match = "1psum dlr"
Is there a way to get a more flexible result size ( as for expected_match
) ?
The solution I use now is to extend the ngrams with (n-k)-grams for k = {1,2,3}
to prevent 3 deletions. It's much better than the first version, but not efficient in terms of speed, as we have more than 3 times the number of ngrams to check. It is also a non generalizable solution.
The Karp-Rabin Algorithm.
A fuzzy search is a technique that uses search algorithms to find strings that match patterns approximately. It's particularly useful for helping users find webpages without having to know exactly what they're looking for or how a word is spelled.
Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.
This function finds best matching substring of variable length.
The implementation considers the corpus as one long string, hence avoiding your concerns with spaces and unseparated words.
Code summary:
1. Scan the corpus for match values in steps of size step
to find the approximate location of highest match value, pos
.
2. Find the substring in the vicinity of pos
with the highest match value, by adjusting the left/right positions of the substring.
from difflib import SequenceMatcher
def get_best_match(query, corpus, step=4, flex=3, case_sensitive=False, verbose=False):
"""Return best matching substring of corpus.
Parameters
----------
query : str
corpus : str
step : int
Step size of first match-value scan through corpus. Can be thought of
as a sort of "scan resolution". Should not exceed length of query.
flex : int
Max. left/right substring position adjustment value. Should not
exceed length of query / 2.
Outputs
-------
output0 : str
Best matching substring.
output1 : float
Match ratio of best matching substring. 1 is perfect match.
"""
def _match(a, b):
"""Compact alias for SequenceMatcher."""
return SequenceMatcher(None, a, b).ratio()
def scan_corpus(step):
"""Return list of match values from corpus-wide scan."""
match_values = []
m = 0
while m + qlen - step <= len(corpus):
match_values.append(_match(query, corpus[m : m-1+qlen]))
if verbose:
print(query, "-", corpus[m: m + qlen], _match(query, corpus[m: m + qlen]))
m += step
return match_values
def index_max(v):
"""Return index of max value."""
return max(range(len(v)), key=v.__getitem__)
def adjust_left_right_positions():
"""Return left/right positions for best string match."""
# bp_* is synonym for 'Best Position Left/Right' and are adjusted
# to optimize bmv_*
p_l, bp_l = [pos] * 2
p_r, bp_r = [pos + qlen] * 2
# bmv_* are declared here in case they are untouched in optimization
bmv_l = match_values[p_l // step]
bmv_r = match_values[p_l // step]
for f in range(flex):
ll = _match(query, corpus[p_l - f: p_r])
if ll > bmv_l:
bmv_l = ll
bp_l = p_l - f
lr = _match(query, corpus[p_l + f: p_r])
if lr > bmv_l:
bmv_l = lr
bp_l = p_l + f
rl = _match(query, corpus[p_l: p_r - f])
if rl > bmv_r:
bmv_r = rl
bp_r = p_r - f
rr = _match(query, corpus[p_l: p_r + f])
if rr > bmv_r:
bmv_r = rr
bp_r = p_r + f
if verbose:
print("\n" + str(f))
print("ll: -- value: %f -- snippet: %s" % (ll, corpus[p_l - f: p_r]))
print("lr: -- value: %f -- snippet: %s" % (lr, corpus[p_l + f: p_r]))
print("rl: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r - f]))
print("rr: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r + f]))
return bp_l, bp_r, _match(query, corpus[bp_l : bp_r])
if not case_sensitive:
query = query.lower()
corpus = corpus.lower()
qlen = len(query)
if flex >= qlen/2:
print("Warning: flex exceeds length of query / 2. Setting to default.")
flex = 3
match_values = scan_corpus(step)
pos = index_max(match_values) * step
pos_left, pos_right, match_value = adjust_left_right_positions()
return corpus[pos_left: pos_right].strip(), match_value
Example:
query = "ipsum dolor"
corpus = "lorem i psum d0l0r sit amet"
match = get_best_match(query, corpus, step=2, flex=4)
print(match)
('i psum d0l0r', 0.782608695652174)
Some good heuristic advice is to always keep step < len(query) * 3/4
, and flex < len(query) / 3
. I also added case sensitivity, in case that's important. It works quite well when you start playing with the step and flex values. Small step values gives better results but takes longer to compute. flex governs how flexible the length of the resulting substring is allowed to be.
Important to note: This will only find the first best match, so if there are multiple equally good matches, only the first will be returned. To allow for multiple matches, change index_max()
to return a list of indices for the n
highest values of the input list, and loop over adjust_left_right_positions()
for values in that list.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With