Is there a way to let difflib consider deletion in string matching?
I've tried the difflib.get_close_matches()
but it doesn't consider strings with lower length in the close matches output. E.g.
from difflib import get_close_matches as gcm
x = """Erfreulich
Erfreuliche
Erfreulicher
Erfreulicherem
Erfreulicheres
Erfreulicherweis
Erfreulicherweise
Erfreuliches
Erfreulichste"""
x = [i for i in x.split("\n")]
for i in x:
print i, gcm(i,x)
Output:
Erfreulich ['Erfreulich', 'Erfreuliche', 'Erfreuliches']
Erfreuliche ['Erfreuliche', 'Erfreuliches', 'Erfreulicher']
Erfreulicher ['Erfreulicher', 'Erfreuliche', 'Erfreulicheres']
Erfreulicherem ['Erfreulicherem', 'Erfreulicheres', 'Erfreulicher']
Erfreulicheres ['Erfreulicheres', 'Erfreulicherweis', 'Erfreulicherem']
Erfreulicherweis ['Erfreulicherweis', 'Erfreulicherweise', 'Erfreulicheres']
Erfreulicherweise ['Erfreulicherweise', 'Erfreulicherweis', 'Erfreulicheres']
Erfreuliches ['Erfreuliches', 'Erfreuliche', 'Erfreulicheres']
Erfreulichste ['Erfreulichste', 'Erfreuliche', 'Erfreuliches']
Note that for the string Erfreulicher
, Erfreulich
isn't considered a close match although the distance is only -1.
get_close_matches() is a function that is available in the difflib Python package. The difflib module provides classes and functions for comparing sequences. We can use this module to compare files and produce information about file differences in various formats.
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs.
This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc.
From the documentation, the n
parameter can be increased to get more matches. Some of the words are shorter, so difflib
does consider deletions.
difflib.get_close_matches(word, possibilities[, n][, cutoff])
Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.
Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignored.
The best (no more than n) matches among the possibilities are returned in a list, sorted by similarity score, most similar first.
Here is the same word with gcm(i,x,6)
:
Erfreulicher ['Erfreulicher', 'Erfreuliche', 'Erfreulicheres', 'Erfreulicherem',
'Erfreuliches', 'Erfreulich']
You should accept Mark Tolonen's answer - he read the docs ;-)
For a bit more insight, note that difflib
's notion of similarity has nothing to do with Levenshtein edit distance - but maybe that's what you really want. When you say:
Note that for the string Erfreulicher, Erfreulich isn't considered a close match although the distance is only -1.
I have no idea what notion of "distance" you have in mind either. The strings differ by 2 characters, right? "-1" is mysterious.
difflib
computes a "similarity score", which is a float in the range 0.0 through 1.0. Here's how to see what it's doing internally, using your list x
:
import difflib
s = difflib.SequenceMatcher()
s.set_seq2("Erfreulicher")
full = []
for i in x:
s.set_seq1(i)
full.append((s.ratio(), i))
full.sort(reverse=True)
for score, i in full:
print "{:20} {:.3f}".format(i, score)
Here's the result, sorted from highest similarity score to lowest:
Erfreulicher 1.000
Erfreuliche 0.957
Erfreulicheres 0.923
Erfreulicherem 0.923
Erfreuliches 0.917
Erfreulich 0.909
Erfreulichste 0.880
Erfreulicherweis 0.857
Erfreulicherweise 0.828
As the docs say, by default get_close_matches()
only returns the top 3. The specific word you're asking about happens to be sixth on the list, and would be returned if you told the function to return the top 6 (or 7, etc) matches (see Mark's answer).
How the score is computed is also documented. Since "Erfreulich" is a prefix of "Erfreulicher", it reduces to:
>>> 2.0 * len("Erfreulich") / (len("Erfreulich") + len("Erfreulicher"))
0.9090909090909091
All the strings above "Erfreulich" on the list have at least one more character in common, which makes the numerator larger. The denominator is also larger for them, but increasing the numerator by (say) 1 has a bigger effect on the result than increasing the denominator by 1. That may or may not match your intuition, but it is how it works ;-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With