Is there an edit distance such as Levenshtein which takes into account distance for substitutions?
For example, if we would consider if words are equal, typo
and tylo
are really close (p
and l
are physically close on the keyboard), while typo
and tyqo
are far apart. I'd like to allocate a smaller distance to more likely typos.
There must be a metric that takes this kind of promixity into account?
Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.
The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.
Levenshtein distance is a lexical similarity measure which identifies the distance between one pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2.
Simply put, edit distance is a measurement of how many changes we must do to one string to transform it into the string we are comparing it to. As an illustration, the difference between “Frederic” and “Fred” is four, as we can change “Frederic” into “Fred” with the deletion of the letters “e” , “r”, “i” and ”c”.
the kind of distance you ask is not included in levenshtein - but you should use a helper like euclidean or manhattan distance, to get the result.my simple assumption is, q (in english qwerty layout) is cartesian (y=0; x=0) so, w will be (y=0; x=1) and so on. whole list here
keyboard_cartesian= { 'q': {'y': 0, 'x': 0}, 'w': {'y': 0, 'x': 1}, 'e': {'y': 0, 'x': 2}, 'r': {'y': 0, 'x': 3}, # ... 'a': {'y': 1, 'x': 0}, #... 'z': {'y': 2, 'x': 0}, 'x' : {'x':1, 'y':2}, # }
assume, word qaz has a meaning. levenshtein distance between qaz
and with both of waz
and eaz
is 1. to check out which misspell is more likely, take the differences (here (q,w) and (q,e)) and calculate euclidean distance
>>> from math import * >>> def euclidean_distance(a,b): ... X = (keyboard_cartesian[a]['x']-keyboard_cartesian[b]['x'])**2 ... Y = (keyboard_cartesian[a]['y']-keyboard_cartesian[b]['y'])**2 ... return sqrt(X+Y) ... >>> euclidean_distance('q', 'w') 1.0 >>> euclidean_distance('q', 'e') 2.0
this means misspell of qaz as waz is more likley than qaz as eaz.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With