Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a sparse edit distance algorithm?

Say you have two strings of length 100,000 containing zeros and ones. You can compute their edit distance in roughly 10^10 operations.

If each string only has 100 ones and the rest are zeros then I can represent each string using 100 integers saying where the ones are.

Is there a much faster algorithm to compute the edit distance using this sparse representation? Even better would be an algorithm that also uses 100^2 space instead of 10^10 space.

To give something to test on, consider these two strings with 10 ones each. The integers say where the ones are in each string.

[9959, 10271, 12571, 21699, 29220, 39972, 70600, 72783, 81449, 83262]

[9958, 10270, 12570, 29221, 34480, 37952, 39973, 83263, 88129, 94336]

In algorithmic terms, if we have two sparse binary strings of length n each represented by k integers each, does there exist an O(k^2) time edit distance algorithm?

like image 313
graffe Avatar asked Aug 03 '18 17:08

graffe


People also ask

What is the type of edit distance?

Types of edit distanceThe Levenshtein distance allows deletion, insertion and substitution. The longest common subsequence (LCS) distance allows only insertion and deletion, not substitution. The Hamming distance allows only substitution, hence, it only applies to strings of the same length.

Is Levenshtein distance NLP?

The Levenshtein distance used as a metric provides a boost to accuracy of an NLP model by verifying each named entity in the entry. The vector search solution does a good job, and finds the most similar entry as defined by the vectorization.

Is Levenshtein distance edit distance?

The Levenshtein distance (a.k.a edit distance) is a measure of similarity between two strings. It is defined as the minimum number of changes required to convert string a into string b (this is done by inserting, deleting or replacing a character in string a ).

Is edit distance NP complete?

The problem of edit distance with moves is NP-complete. A “recursive” sequence of moves can be simulated with at most a constant factor increase by a non-recursive sequence.


1 Answers

Of course! There are so few possible operations with so many 0s. I mean, the answer is at most 200.

Take a look at

10001010000000001
vs       ||||||
10111010100000010

Look at all the zeroes with pipes. Does it matter which one out of those you end up deleting? Nope. That's the key.


Solution 1

Let's consider the normal n*m solution:

dp(int i, int j) {
    // memo & base case
    if( str1[i-1] == str1[j-1] ) {
        return dp(i-1, j-1);
    }
    return 1 + min( dp(i-1, j), dp(i-1, j-1), dp(i, j-1) );
}

If almost every single character was a 0, what would hog the most amount of time?

if( str1[i-1] == str1[j-1] ) { // They will be equal so many times, (99900)^2 times!
    return dp(i-1, j-1);
}

I could imagine that trickling down for tens of thousands of recursions. All you actually need logic for are the ~200 critical points. You can ignore the rest. A simple modification would be

if( str1[i-1] == str1[j-1] ) {
    if( str1[i-1] == 1 )
        return dp(i-1, j-1); // Already hit a critical point

    // rightmost location of a 1 in str1 or str2, that is <= i-1
    best = binarySearch(CriticalPoints, i-1);
    return dp(best + 1, best + 1); // Use that critical point
    // Important! best+1 because we still want to compute the answer at best
    // Without it, we would skip over in a case where str1[best] is 1, and str2[best] is 0.
}

CriticalPoints would be the array containing the index of every 1 in either str1 or str2. Make sure that it's sorted before you binary search. Keep in mind those gochya's. My logic was: Okay I need to make sure to calculate the answer at the index best itself, so let's go with best + 1 as the parameter. But, if best == i - 1, we get stuck in a loop. I'll handle that with a quick str1[i-1] == 1 check. Done, phew.

You can do a quick check for correctness by noting that at worst case you will hit all 200*100000 combinations of i and j that make critical points, and when those critical points call min(a, b, c), it only makes three recursive function calls. If any of those functions are critical points, then it's part of those 200*100000 we already counted and we can ignore it. If it's not, then in O(log(200)) it falls into a single call on another critical point (Now, it's something we know is part of the 200*100000 we already counted). Thus, each critical point takes at worst 3*log(200) time excluding calls to other critical points. Similarly, the very first function call will fall into a critical point in log(200) time. Thus, we have an upper bound of O(200*100000*3*log(200) + log(200)).

Also, make sure your memo table is a hashmap, not an array. 10^10 memory will not fit on your computer.


Solution 2

You know the answer is at most 200, so just prevent yourself from computing more than that many operations deep.

dp(int i, int j) { // O(100000 * 205), sounds good to me.
    if( abs(i - j) > 205 )
        return 205; // The answer in this case is at least 205, so it's irrelevant to calculating the answer because when min is called, it wont be smallest.
    // memo & base case
    if( str1[i-1] == str1[j-1] ) {
        return dp(i-1, j-1);
    }
    return 1 + min( dp(i-1, j), dp(i-1, j-1), dp(i, j-1) );
}

This one is very simple, but I leave it for solution two because this solution seems to have come out from thin air, as opposed to analyzing the problem and figuring out where the slow part is and how to optimize it. Keep this in your toolbox though, since you should be coding this solution.


Solution 3

Just like Solution 2, we could have done it like this:

dp(int i, int j, int threshold = 205) {
    if( threshold == 0 )
        return 205;
    // memo & base case
    if( str1[i-1] == str1[j-1] ) {
        return dp(i-1, j-1);
    }
    return 1 + min( dp(i-1, j, threshold - 1), dp(i-1, j-1, threshold - 1), dp(i, j-1, threshold - 1) );
}

You might be worried about dp(i-1, j-1) trickling down, but the threshold keeps i and j close together so it calculates a subset of Solution 2. This is because the threshold gets decremented every time i and j get farther apart. dp(i-1, j-1, threshold) would make it identical to Solution 2 (Thus, this one is slightly faster).


Space

These solutions will give you the answer very quickly, but if you want a space-optimizing solution as well, it would be easy to replace str1[i] with (i in Str1CriticalPoints) ? 1 : 0, using a hashmap. This would give a final solution that is still very fast (Though will be 10x slower), and also avoids keeping the long strings in memory (To the point where it could run on an Arduino). I don't think this is necessary though.

Note that the original solution does not use 10^10 space. You mention "even better, less than 10^10 space", with an implication that 10^10 space would be acceptable. Unfortunately, even with enough RAM, iterating though that space takes 10^10 time, which is definitely not acceptable. None of my solutions use 10^10 space: only 2 * 10^5 to hold the strings - which can be avoided as discussed above. 10^10 Bytes it 10 GB for reference.


EDIT: As maniek notes, you only need to check abs(i - j) > 105, as the remaining 100 insertions needed to equate i and j will pull the number of operations above 200.

like image 113
Nicholas Pipitone Avatar answered Oct 07 '22 18:10

Nicholas Pipitone