Let's say I have 2 strings which is pretty similar. I want to find other string which is close to s1 and s2 in terms of Levenshtein distance.
import Levenshtein
s1 = 'aaabbbccc'
s2 = 'abacbbbccde'
dist = Levenshtein.distance(s1, s2)
print(dist)
mid_str = get_avg_string(s1, s2)
What can be effective implementation of function:
def get_avg_string(s1, s2):
return ''
I need that this variables:
sum_lev = Levenshtein.distance(s1, mid_str) + Levenshtein.distance(s2, mid_str)
diff_lev = abs(Levenshtein.distance(s1, mid_str) - Levenshtein.distance(s2, mid_str)
to be minimal (I think sum_lev
will be equal to dist
and diff_lev <= 1
).
Hamming Distance, named after the American mathematician, is the simplest algorithm for calculating string similarity. It checks the similarity by comparing the changes in the number of positions between the two strings.
The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.
Delete 'm'th character of str1 and compute edit distance between 'm-1' characters of str1 and 'n' characters of str2. For this computation, we simply have to do - (1 + array[m-1][n]) where 1 is the cost of delete operation and array[m-1][n] is edit distance between 'm-1' characters of str1 and 'n' characters of str2.
Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.
Levenshtein distance is a lexical similarity measure which identifies the distance between one pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2.
For example, if the first string a = 'abc' and the second string is b = 'abc', the Levenshtein distance between the two strings is 0 because a and b are equal. If a = 'abcd' and b = 'a', the distance is 3. If a = 'abcd' and b = 'aacc', the distance is 2.
The distance value describes the minimal number of deletions, insertions, or substitutions that are required to transform one string (the source) into another (the target). Unlike the Hamming distance, the Levenshtein distance works on strings with an unequal length.
Unlike the Hamming distance, the Levenshtein distance works on strings with an unequal length. The greater the Levenshtein distance, the greater are the difference between the strings. For example, from "test" to "test" the Levenshtein distance is 0 because both the source and target strings are identical.
I'm afraid that what you ask for is not possible, since the problem is NP-hard. I will try to outline a few of the key concepts for why that is the case, but I'd encourage you to look up Center Strings and Steiner Strings.
Suppose that you have a set of strings called S. The optimal Steiner String is a string which minimizes the sum of the distances of each string in S to itself (also known as consensus error). This corresponds to the first property, which you called sum_lev
. The optimal Steiner String is usually ambiguous and not part of the original set S (but doesn't have to be).
The problem you are facing is that there is no efficient way to compute the optimal Steiner String. Even if you manage to restrict your search space, you will still have an exponential amount of candidates. The problem is hence NP-hard.
It can be proven that S always contains a string which is a reasonable approximation of the optimal Steiner String. So even if you only pay attention to the first of your two properties, the best shot you have is to simply select one of your original strings. Since you are apparently only dealing with two strings, it should not matter which one you choose.
TL;DR
To summarize, you are dealing with an NP-hard problem which can not be solved efficiently, but only approximated. If you are only dealing with two strings, the string you are looking for can be approximated by one of the given strings. I'm sorry that this is probably not the answer you hoped for, but hopefully it was still somewhat helpful.
So first of all let's assume string1
(lets call it s1
) is the before and string2
(lets call it s2
) after transformation. This way we can easily separate add and remove character operations.
Let's consider example given by you.
Levenshtein.distance(s1='aaabbbccc', s2='abacbbbccde')
This means we are asking question how many operations separete these string(How much it costs to mutate one into other).
Now that we assume s1
is the start point, let's see what the algorithm gives us.
We can calculate that distance between s1
and s2
, and it spits out integer value of 4
It comes from the last value of the calculated Levenshtein matrix, like so:
As we can see there are places where value goes up and where it stays the same.
If we go over the matrix from the top left corner, we should read it like:
s1
strings1
stringOur goal is to get to the bottom-right corner and the result value is the cost(or distance) that is associated with it.
Let's see how matrix will change its values when we change last value in s1
As we can see previous intersection of c
xd
changed to d
xd
and now the cost does not rise in this place and that results in smaller distance between those strings.
What we see is that small changes in s1
will result in distance change of 1
and
if we compare original s1
to the modified one:
It look preety damn close in term of Levenshtein distance.
There is potentially an algorithm to generate quite a lot of strings similar to s1
and s2
.
It would go over generated matrix and change single character in a string to generate next solution.
We should consider multiple changes done to a original matrix. And then for each new solution we potentially want to calculate new Levenshtein matrix and use it as next source to generate solutions.
And we should never consider only lowering these values, that would generate only portion of potential solutions.
One thing that is important to consider. In term of Levenshtein distance it does not matter if we compare character a
to b
or a
to c
all that is important is that "Are those the same character?" if not we do not care about its value.
A little expansion of @Matze's answer - when you have NP-hard problem to solve you could use genetic algorithm to find some solution that might be better then just taking one of strings in finite time (no guarantees that it would be best or even better then one of original strings)
This is not a solution but something to start with; an immediate improvement would be on function eq_len_strings
that is equalizing the strings to the right; also you can take sub-strings of s1
on s2
to create your first mid-string (since this helps reduce the Levenshtein distance) and then just fill the _
with any character on your alphabet as search_mid_string
does.
Another improvement is avoid (contrary to what I do) to fill all blanks (_
) maybe adding the empty string to your alphabet or considering the difference in length for both strings.
import Levenshtein
def eq_len_strings(s1, s2):
if len(s1) < len(s2):
s1 = s1.ljust(len(s1)+len(s2)-len(s1), '_')
elif len(s1) > len(s2):
s2 = s2.ljust(len(s2)+len(s1)-len(s2), '_')
return s1, s2
def keep_eq_chars(s1, s2):
s = ''
for char1, char2 in zip(s1, s2):
if char1 == '_' or char2 == '_' or char1 != char2:
s += '_'
else:
s += char1
return s
def search_mid_string(s1, s2):
alph = set(list(s1+s2))
s1e, s2e = eq_len_strings(s1, s2)
# start string
s_mid = list(keep_eq_chars(s1e, s2e))
alternate = 0
for i in range(len(s_mid)):
char1 = s1[i] if i < len(s1)-1 else '_'
char2 = s2[i] if i < len(s2)-1 else '_'
if s_mid[i] == '_':
if alternate == 0 and char1 != '_':
s_mid[i] = char1
alternate = 1
elif alternate == 1 and char2 != '_':
s_mid[i] = char2
alternate = 0
else:
s_mid[i] = ''
s1_to_s2_dist = Levenshtein.distance(s1, s2)
s1_to_mid_dist = Levenshtein.distance(s1, ''.join(s_mid))
s2_to_mid_dist = Levenshtein.distance(s2, ''.join(s_mid))
ans = {
's1_to_s2_dist': s1_to_s2_dist,
's1_to_mid_dist': s1_to_mid_dist,
's2_to_mid_dist': s2_to_mid_dist,
's_mid': ''.join(s_mid)
}
return ans
With the strings given:
s1 = 'aaabbbccc'
s2 = 'abacbbbccde'
search_mid_string(s1, s2)
// Output
>{'s1_to_s2_dist': 4,
> 's1_to_mid_dist': 2,
> 's2_to_mid_dist': 3,
> 's_mid': 'aaacbbcccd'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With