Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find string similar to 2 other strings (in terms of Levenshtein distance)?

Let's say I have 2 strings which is pretty similar. I want to find other string which is close to s1 and s2 in terms of Levenshtein distance.

import Levenshtein
s1 = 'aaabbbccc'
s2 = 'abacbbbccde'
dist = Levenshtein.distance(s1, s2)
print(dist)
mid_str = get_avg_string(s1, s2)

What can be effective implementation of function:

def get_avg_string(s1, s2):
    return ''

I need that this variables:

sum_lev = Levenshtein.distance(s1, mid_str) + Levenshtein.distance(s2, mid_str)
diff_lev = abs(Levenshtein.distance(s1, mid_str) - Levenshtein.distance(s2, mid_str)

to be minimal (I think sum_lev will be equal to dist and diff_lev <= 1).

like image 995
ZFTurbo Avatar asked Mar 18 '21 08:03

ZFTurbo


People also ask

How do you find the similarity between two strings?

Hamming Distance, named after the American mathematician, is the simplest algorithm for calculating string similarity. It checks the similarity by comparing the changes in the number of positions between the two strings.

How is Levenshtein distance calculated?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.

How do you find the edit distance between two strings?

Delete 'm'th character of str1 and compute edit distance between 'm-1' characters of str1 and 'n' characters of str2. For this computation, we simply have to do - (1 + array[m-1][n]) where 1 is the cost of delete operation and array[m-1][n] is edit distance between 'm-1' characters of str1 and 'n' characters of str2.

Is edit distance and Levenshtein distance same?

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

What is Levenshtein distance in Python?

Levenshtein distance is a lexical similarity measure which identifies the distance between one pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2.

How do you find the Levenshtein distance between two strings?

For example, if the first string a = 'abc' and the second string is b = 'abc', the Levenshtein distance between the two strings is 0 because a and b are equal. If a = 'abcd' and b = 'a', the distance is 3. If a = 'abcd' and b = 'aacc', the distance is 2.

What is Levenshtein distance in SQL Server?

The distance value describes the minimal number of deletions, insertions, or substitutions that are required to transform one string (the source) into another (the target). Unlike the Hamming distance, the Levenshtein distance works on strings with an unequal length.

What is the difference between Hamming and Levenshtein distance?

Unlike the Hamming distance, the Levenshtein distance works on strings with an unequal length. The greater the Levenshtein distance, the greater are the difference between the strings. For example, from "test" to "test" the Levenshtein distance is 0 because both the source and target strings are identical.


Video Answer


4 Answers

I'm afraid that what you ask for is not possible, since the problem is NP-hard. I will try to outline a few of the key concepts for why that is the case, but I'd encourage you to look up Center Strings and Steiner Strings.

Suppose that you have a set of strings called S. The optimal Steiner String is a string which minimizes the sum of the distances of each string in S to itself (also known as consensus error). This corresponds to the first property, which you called sum_lev. The optimal Steiner String is usually ambiguous and not part of the original set S (but doesn't have to be).

The problem you are facing is that there is no efficient way to compute the optimal Steiner String. Even if you manage to restrict your search space, you will still have an exponential amount of candidates. The problem is hence NP-hard.

It can be proven that S always contains a string which is a reasonable approximation of the optimal Steiner String. So even if you only pay attention to the first of your two properties, the best shot you have is to simply select one of your original strings. Since you are apparently only dealing with two strings, it should not matter which one you choose.

TL;DR

To summarize, you are dealing with an NP-hard problem which can not be solved efficiently, but only approximated. If you are only dealing with two strings, the string you are looking for can be approximated by one of the given strings. I'm sorry that this is probably not the answer you hoped for, but hopefully it was still somewhat helpful.

like image 88
Matze Avatar answered Oct 23 '22 18:10

Matze


Assumption

So first of all let's assume string1(lets call it s1) is the before and string2(lets call it s2) after transformation. This way we can easily separate add and remove character operations.

The example

Let's consider example given by you. Levenshtein.distance(s1='aaabbbccc', s2='abacbbbccde') This means we are asking question how many operations separete these string(How much it costs to mutate one into other).

Levenshtein matrix

Now that we assume s1 is the start point, let's see what the algorithm gives us.
We can calculate that distance between s1 and s2, and it spits out integer value of 4
It comes from the last value of the calculated Levenshtein matrix, like so:
enter image description here

Walk Levennshtein matrix

As we can see there are places where value goes up and where it stays the same.
If we go over the matrix from the top left corner, we should read it like:

  • going down means: adding a character to the s1 string
  • going right means: removing a character from the s1 string
  • going diagonally down-right means: replacing the character

Our goal is to get to the bottom-right corner and the result value is the cost(or distance) that is associated with it.

Distance change in matrix

Let's see how matrix will change its values when we change last value in s1 enter image description here As we can see previous intersection of cxd changed to dxd and now the cost does not rise in this place and that results in smaller distance between those strings.
What we see is that small changes in s1 will result in distance change of 1 and if we compare original s1 to the modified one:
enter image description here
It look preety damn close in term of Levenshtein distance.

Conclusion

There is potentially an algorithm to generate quite a lot of strings similar to s1 and s2.
It would go over generated matrix and change single character in a string to generate next solution.
We should consider multiple changes done to a original matrix. And then for each new solution we potentially want to calculate new Levenshtein matrix and use it as next source to generate solutions.
And we should never consider only lowering these values, that would generate only portion of potential solutions.

One thing that is important to consider. In term of Levenshtein distance it does not matter if we compare character a to b or a to c all that is important is that "Are those the same character?" if not we do not care about its value.

like image 32
Iluvatar Avatar answered Oct 23 '22 20:10

Iluvatar


A little expansion of @Matze's answer - when you have NP-hard problem to solve you could use genetic algorithm to find some solution that might be better then just taking one of strings in finite time (no guarantees that it would be best or even better then one of original strings)

like image 2
konserw Avatar answered Oct 23 '22 19:10

konserw


This is not a solution but something to start with; an immediate improvement would be on function eq_len_strings that is equalizing the strings to the right; also you can take sub-strings of s1 on s2 to create your first mid-string (since this helps reduce the Levenshtein distance) and then just fill the _ with any character on your alphabet as search_mid_string does.

Another improvement is avoid (contrary to what I do) to fill all blanks (_) maybe adding the empty string to your alphabet or considering the difference in length for both strings.

import Levenshtein

def eq_len_strings(s1, s2):
    if len(s1) < len(s2):
        s1 = s1.ljust(len(s1)+len(s2)-len(s1), '_')
    elif len(s1) > len(s2):
        s2 = s2.ljust(len(s2)+len(s1)-len(s2), '_')
    return s1, s2

def keep_eq_chars(s1, s2):
    s = ''
    for char1, char2 in zip(s1, s2):
        if char1 == '_' or char2 == '_' or char1 != char2:
            s += '_'
        else:
            s += char1
    return s

def search_mid_string(s1, s2):
    alph = set(list(s1+s2))
    s1e, s2e = eq_len_strings(s1, s2)
    # start string
    s_mid = list(keep_eq_chars(s1e, s2e))
    
    alternate = 0
    for i in range(len(s_mid)):
        char1 = s1[i] if i < len(s1)-1 else '_'
        char2 = s2[i] if i < len(s2)-1 else '_'
        if s_mid[i] == '_':
            if alternate == 0 and char1 != '_':
                s_mid[i] = char1
                alternate = 1
            elif alternate == 1 and char2 != '_':
                s_mid[i] = char2
                alternate = 0
            else:
                s_mid[i] = ''
    s1_to_s2_dist = Levenshtein.distance(s1, s2)
    s1_to_mid_dist = Levenshtein.distance(s1, ''.join(s_mid))
    s2_to_mid_dist = Levenshtein.distance(s2, ''.join(s_mid))
    ans = {
        's1_to_s2_dist': s1_to_s2_dist,
        's1_to_mid_dist': s1_to_mid_dist,
        's2_to_mid_dist': s2_to_mid_dist,
        's_mid': ''.join(s_mid)
    }
    return ans

With the strings given:

s1 = 'aaabbbccc'
s2 = 'abacbbbccde'

search_mid_string(s1, s2)

// Output
>{'s1_to_s2_dist': 4,
> 's1_to_mid_dist': 2,
> 's2_to_mid_dist': 3,
> 's_mid': 'aaacbbcccd'}
like image 2
Memristor Avatar answered Oct 23 '22 18:10

Memristor