Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Edit distance such as Levenshtein taking into account proximity on keyboard

Tags:

Is there an edit distance such as Levenshtein which takes into account distance for substitutions?

For example, if we would consider if words are equal, typo and tylo are really close (p and l are physically close on the keyboard), while typo and tyqo are far apart. I'd like to allocate a smaller distance to more likely typos.

There must be a metric that takes this kind of promixity into account?

like image 260
PascalVKooten Avatar asked Mar 24 '15 13:03

PascalVKooten


People also ask

Is edit distance same as Levenshtein?

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

How do you do Levenshtein distance?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.

What is Levenshtein distance Python?

Levenshtein distance is a lexical similarity measure which identifies the distance between one pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2.

What is edit distance matrix in NLP?

Simply put, edit distance is a measurement of how many changes we must do to one string to transform it into the string we are comparing it to. As an illustration, the difference between “Frederic” and “Fred” is four, as we can change “Frederic” into “Fred” with the deletion of the letters “e” , “r”, “i” and ”c”.


1 Answers

the kind of distance you ask is not included in levenshtein - but you should use a helper like euclidean or manhattan distance, to get the result.my simple assumption is, q (in english qwerty layout) is cartesian (y=0; x=0) so, w will be (y=0; x=1) and so on. whole list here

keyboard_cartesian= {                      'q': {'y': 0, 'x': 0},                      'w': {'y': 0, 'x': 1},                      'e': {'y': 0, 'x': 2},                         'r': {'y': 0, 'x': 3},                           # ...                      'a': {'y': 1, 'x': 0},                        #...                      'z': {'y': 2, 'x': 0},                      'x' : {'x':1, 'y':2},                       #                         } 

assume, word qaz has a meaning. levenshtein distance between qaz and with both of waz and eaz is 1. to check out which misspell is more likely, take the differences (here (q,w) and (q,e)) and calculate euclidean distance

>>> from math import * >>> def euclidean_distance(a,b): ...     X = (keyboard_cartesian[a]['x']-keyboard_cartesian[b]['x'])**2 ...     Y = (keyboard_cartesian[a]['y']-keyboard_cartesian[b]['y'])**2 ...     return sqrt(X+Y) ...  >>> euclidean_distance('q', 'w') 1.0  >>> euclidean_distance('q', 'e') 2.0 

this means misspell of qaz as waz is more likley than qaz as eaz.

like image 120
marmeladze Avatar answered Oct 04 '22 02:10

marmeladze