Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Edit Distance in Python

I'm programming a spellcheck program in Python. I have a list of valid words (the dictionary) and I need to output a list of words from this dictionary that have an edit distance of 2 from a given invalid word.

I know I need to start by generating a list with an edit distance of one from the invalid word(and then run that again on all the generated words). I have three methods, inserts(...), deletions(...) and changes(...) that should output a list of words with an edit distance of 1, where inserts outputs all valid words with one more letter than the given word, deletions outputs all valid words with one less letter, and changes outputs all valid words with one different letter.

I've checked a bunch of places but I can't seem to find an algorithm that describes this process. All the ideas I've come up with involve looping through the dictionary list multiple times, which would be extremely time consuming. If anyone could offer some insight, I'd be extremely grateful.

like image 887
Mel Avatar asked Mar 17 '10 06:03

Mel


People also ask

What is edit distance with example?

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.

What is edit distance in NLP?

Simply put, edit distance is a measurement of how many changes we must do to one string to transform it into the string we are comparing it to. As an illustration, the difference between “Frederic” and “Fred” is four, as we can change “Frederic” into “Fred” with the deletion of the letters “e” , “r”, “i” and ”c”.

What is edit distance problem?

The edit distance problem is the minimum number of insertions, deletions, or replacements required to convert one string to another. What is the time and space complexity of the dynamic programming approach? The time and space complexity of the dynamic programming approach is O(N * M)


2 Answers

The thing you are looking at is called an edit distance and here is a nice explanation on wiki. There are a lot of ways how to define a distance between the two words and the one that you want is called Levenshtein distance and here is a DP (dynamic programming) implementation in python.

def levenshteinDistance(s1, s2):     if len(s1) > len(s2):         s1, s2 = s2, s1      distances = range(len(s1) + 1)     for i2, c2 in enumerate(s2):         distances_ = [i2+1]         for i1, c1 in enumerate(s1):             if c1 == c2:                 distances_.append(distances[i1])             else:                 distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))         distances = distances_     return distances[-1] 

And a couple of more implementations are here.

like image 82
Salvador Dali Avatar answered Sep 18 '22 06:09

Salvador Dali


difflib in the standard library has various utilities for sequence matching, including the get_close_matches method that you could use. It uses an algorithm adapted from Ratcliff and Obershelp.

From the docs

>>> from difflib import get_close_matches >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy']) ['apple', 'ape'] 
like image 42
ryanjdillon Avatar answered Sep 21 '22 06:09

ryanjdillon