Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get close string matches considering deletion - python

Is there a way to let difflib consider deletion in string matching?

I've tried the difflib.get_close_matches() but it doesn't consider strings with lower length in the close matches output. E.g.

from difflib import get_close_matches as gcm

x = """Erfreulich
Erfreuliche
Erfreulicher
Erfreulicherem
Erfreulicheres
Erfreulicherweis
Erfreulicherweise
Erfreuliches
Erfreulichste"""

x = [i for i in x.split("\n")]

for i in x:
  print i, gcm(i,x)

Output:

Erfreulich ['Erfreulich', 'Erfreuliche', 'Erfreuliches']
Erfreuliche ['Erfreuliche', 'Erfreuliches', 'Erfreulicher']
Erfreulicher ['Erfreulicher', 'Erfreuliche', 'Erfreulicheres']
Erfreulicherem ['Erfreulicherem', 'Erfreulicheres', 'Erfreulicher']
Erfreulicheres ['Erfreulicheres', 'Erfreulicherweis', 'Erfreulicherem']
Erfreulicherweis ['Erfreulicherweis', 'Erfreulicherweise', 'Erfreulicheres']
Erfreulicherweise ['Erfreulicherweise', 'Erfreulicherweis', 'Erfreulicheres']
Erfreuliches ['Erfreuliches', 'Erfreuliche', 'Erfreulicheres']
Erfreulichste ['Erfreulichste', 'Erfreuliche', 'Erfreuliches']

Note that for the string Erfreulicher, Erfreulich isn't considered a close match although the distance is only -1.

like image 352
alvas Avatar asked Oct 28 '13 14:10

alvas


People also ask

What is get close matches in Python?

get_close_matches() is a function that is available in the difflib Python package. The difflib module provides classes and functions for comparing sequences. We can use this module to compare files and produce information about file differences in various formats.

How difflib works?

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs.

Is Difflib standard Python?

This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc.


2 Answers

From the documentation, the n parameter can be increased to get more matches. Some of the words are shorter, so difflib does consider deletions.

difflib.get_close_matches(word, possibilities[, n][, cutoff])
Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).

Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.

Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignored.

The best (no more than n) matches among the possibilities are returned in a list, sorted by similarity score, most similar first.

Here is the same word with gcm(i,x,6):

Erfreulicher ['Erfreulicher', 'Erfreuliche', 'Erfreulicheres', 'Erfreulicherem',
              'Erfreuliches', 'Erfreulich']
like image 104
Mark Tolonen Avatar answered Oct 02 '22 14:10

Mark Tolonen


You should accept Mark Tolonen's answer - he read the docs ;-)

For a bit more insight, note that difflib's notion of similarity has nothing to do with Levenshtein edit distance - but maybe that's what you really want. When you say:

Note that for the string Erfreulicher, Erfreulich isn't considered a close match although the distance is only -1.

I have no idea what notion of "distance" you have in mind either. The strings differ by 2 characters, right? "-1" is mysterious.

difflib computes a "similarity score", which is a float in the range 0.0 through 1.0. Here's how to see what it's doing internally, using your list x:

import difflib
s = difflib.SequenceMatcher()
s.set_seq2("Erfreulicher")
full = []
for i in x:
    s.set_seq1(i)
    full.append((s.ratio(), i))
full.sort(reverse=True)
for score, i in full:
    print "{:20} {:.3f}".format(i, score)

Here's the result, sorted from highest similarity score to lowest:

Erfreulicher         1.000
Erfreuliche          0.957
Erfreulicheres       0.923
Erfreulicherem       0.923
Erfreuliches         0.917
Erfreulich           0.909
Erfreulichste        0.880
Erfreulicherweis     0.857
Erfreulicherweise    0.828

As the docs say, by default get_close_matches() only returns the top 3. The specific word you're asking about happens to be sixth on the list, and would be returned if you told the function to return the top 6 (or 7, etc) matches (see Mark's answer).

How the score is computed is also documented. Since "Erfreulich" is a prefix of "Erfreulicher", it reduces to:

>>> 2.0 * len("Erfreulich") / (len("Erfreulich") + len("Erfreulicher"))
0.9090909090909091

All the strings above "Erfreulich" on the list have at least one more character in common, which makes the numerator larger. The denominator is also larger for them, but increasing the numerator by (say) 1 has a bigger effect on the result than increasing the denominator by 1. That may or may not match your intuition, but it is how it works ;-)

like image 28
Tim Peters Avatar answered Oct 02 '22 13:10

Tim Peters