Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modify Levenshtein-Distance to ignore order

I'm looking to compute the the Levenshtein-distance between sequences containing up to 6 values. The order of these values should not affect the distance.

How would I implement this into the iterative or recursive algorithm?

Example:

# Currently 
>>> LDistance('dog', 'god')
2

# Sorted
>>> LDistance('dgo', 'dgo')
0

# Proposed
>>> newLDistance('dog', 'god')
0

'dog' and 'god' have the exact same letters, sorting the strings before hand will return the desired result. However this doesn't work all the time:

# Currently 
>>> LDistance('doge', 'gold')
3

# Sorted
>>> LDistance('dego', 'dglo')
2

# Proposed
>>> newLDistance('doge', 'gold')
1

'doge' and 'gold' have 3/4 matching letters and so should return a distance of 1. Here is my current recursive code:

def mLD(s, t):
    memo = {}
    def ld(s, t):
        if not s: return len(t)
        if not t: return len(s)
        if s[0] == t[0]: return ld(s[1:], t[1:])
        if (s, t) not in memo:
            l1 = ld(s, t[1:])
            l2 = ld(s[1:], t)
            l3 = ld(s[1:], t[1:])
            memo[(s,t)] = 1 + min(l1, l2, l3)
        return memo[(s,t)]
    return ld(s, t)

EDIT: Followup question: Adding exceptions to Levenshtein-Distance-like algorithm

like image 923
Luis Avatar asked Sep 08 '15 11:09

Luis


People also ask

Does order matter in Levenshtein distance?

The order of these values should not affect the distance.

How do you normalize Levenshtein distance?

To quantify the similarity, we normalize the edit distance. One approach is to calculate edit distance as usual and then divide it by the number of operations, usually called the length of the edit path. This is called edit distance with post-normalization.

Is Levenshtein distance edit distance?

The Levenshtein distance (a.k.a edit distance) is a measure of similarity between two strings. It is defined as the minimum number of changes required to convert string a into string b (this is done by inserting, deleting or replacing a character in string a ).

What are three major operations of Levenshtein edit distance?

Most commonly, the edit operations allowed for this purpose are: (i) insert a character into a string; (ii) delete a character from a string and (iii) replace a character of a string by another character; for these operations, edit distance is sometimes known as Levenshtein distance .


2 Answers

You don't need the Levenshtein machinery for this.

import collections
def distance(s1, s2):
    cnt = collections.Counter()
    for c in s1:
        cnt[c] += 1
    for c in s2:
        cnt[c] -= 1
    return sum(abs(diff) for diff in cnt.values()) // 2 + \
        (abs(sum(cnt.values())) + 1) // 2   # can be omitted if len(s1) == len(s2)
like image 190
David Eisenstat Avatar answered Oct 11 '22 04:10

David Eisenstat


Why not just count how many letters are in common, and find and answer from this? For each character calculate its frequency, then for each string calculate how many "extra" characters it has based on frequencies, and take maximum of these "extra".

Pseudocode:

for c in s1:
    cnt1[c]++
for c in s2:
    cnt2[c]++
extra1 = 0
extra2 = 0
for c in all_chars:
    if cnt1[c]>cnt2[c]
        extra1 += cnt1[c]-cnt2[c]
    else
        extra2 += cnt2[c]-cnt1[c]
return max(extra1, extra2)
like image 21
Petr Avatar answered Oct 11 '22 04:10

Petr