I'm looking to compute the the Levenshtein-distance between sequences containing up to 6 values. The order of these values should not affect the distance.
How would I implement this into the iterative or recursive algorithm?
Example:
# Currently
>>> LDistance('dog', 'god')
2
# Sorted
>>> LDistance('dgo', 'dgo')
0
# Proposed
>>> newLDistance('dog', 'god')
0
'dog' and 'god' have the exact same letters, sorting the strings before hand will return the desired result. However this doesn't work all the time:
# Currently
>>> LDistance('doge', 'gold')
3
# Sorted
>>> LDistance('dego', 'dglo')
2
# Proposed
>>> newLDistance('doge', 'gold')
1
'doge' and 'gold' have 3/4 matching letters and so should return a distance of 1. Here is my current recursive code:
def mLD(s, t):
memo = {}
def ld(s, t):
if not s: return len(t)
if not t: return len(s)
if s[0] == t[0]: return ld(s[1:], t[1:])
if (s, t) not in memo:
l1 = ld(s, t[1:])
l2 = ld(s[1:], t)
l3 = ld(s[1:], t[1:])
memo[(s,t)] = 1 + min(l1, l2, l3)
return memo[(s,t)]
return ld(s, t)
EDIT: Followup question: Adding exceptions to Levenshtein-Distance-like algorithm
The order of these values should not affect the distance.
To quantify the similarity, we normalize the edit distance. One approach is to calculate edit distance as usual and then divide it by the number of operations, usually called the length of the edit path. This is called edit distance with post-normalization.
The Levenshtein distance (a.k.a edit distance) is a measure of similarity between two strings. It is defined as the minimum number of changes required to convert string a into string b (this is done by inserting, deleting or replacing a character in string a ).
Most commonly, the edit operations allowed for this purpose are: (i) insert a character into a string; (ii) delete a character from a string and (iii) replace a character of a string by another character; for these operations, edit distance is sometimes known as Levenshtein distance .
You don't need the Levenshtein machinery for this.
import collections
def distance(s1, s2):
cnt = collections.Counter()
for c in s1:
cnt[c] += 1
for c in s2:
cnt[c] -= 1
return sum(abs(diff) for diff in cnt.values()) // 2 + \
(abs(sum(cnt.values())) + 1) // 2 # can be omitted if len(s1) == len(s2)
Why not just count how many letters are in common, and find and answer from this? For each character calculate its frequency, then for each string calculate how many "extra" characters it has based on frequencies, and take maximum of these "extra".
Pseudocode:
for c in s1:
cnt1[c]++
for c in s2:
cnt2[c]++
extra1 = 0
extra2 = 0
for c in all_chars:
if cnt1[c]>cnt2[c]
extra1 += cnt1[c]-cnt2[c]
else
extra2 += cnt2[c]-cnt1[c]
return max(extra1, extra2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With