Say you have two strings of length 100,000 containing zeros and ones. You can compute their edit distance in roughly 10^10 operations. If each string only has 100 ones and the rest are zeros then I can represent each string using 100 integers saying where the ones are. <blockquote> Is there a much faster algorithm to compute the edit distance using this sparse representation? Even better would be an algorithm that also uses 100^2 space instead of 10^10 space. </blockquote> To give something to test on, consider these two strings with 10 ones each. The integers say where the ones are in each string. <pre class="prettyprint"><code>[9959, 10271, 12571, 21699, 29220, 39972, 70600, 72783, 81449, 83262] [9958, 10270, 12570, 29221, 34480, 37952, 39973, 83263, 88129, 94336] </code></pre> <hr> In algorithmic terms, if we have two sparse binary strings of length <code>n</code> each represented by <code>k</code> integers each, does there exist an <code>O(k^2)</code> time edit distance algorithm?

Of course! There are so few possible operations with so many 0s. I mean, the answer is at most 200. Take a look at <pre class="prettyprint"><code>10001010000000001 vs |||||| 10111010100000010 </code></pre> Look at all the zeroes with pipes. Does it matter which one out of those you end up deleting? Nope. That's the key. <hr> <h3>Solution 1</h3> Let's consider the normal n*m solution: <pre class="prettyprint"><code>dp(int i, int j) { // memo & base case if( str1[i-1] == str1[j-1] ) { return dp(i-1, j-1); } return 1 + min( dp(i-1, j), dp(i-1, j-1), dp(i, j-1) ); } </code></pre> If almost every single character was a 0, what would hog the most amount of time? <pre class="prettyprint"><code>if( str1[i-1] == str1[j-1] ) { // They will be equal so many times, (99900)^2 times! return dp(i-1, j-1); } </code></pre> I could imagine that trickling down for tens of thousands of recursions. All you actually need logic for are the ~200 critical points. You can ignore the rest. A simple modification would be <pre class="prettyprint"><code>if( str1[i-1] == str1[j-1] ) { if( str1[i-1] == 1 ) return dp(i-1, j-1); // Already hit a critical point // rightmost location of a 1 in str1 or str2, that is <= i-1 best = binarySearch(CriticalPoints, i-1); return dp(best + 1, best + 1); // Use that critical point // Important! best+1 because we still want to compute the answer at best // Without it, we would skip over in a case where str1[best] is 1, and str2[best] is 0. } </code></pre> CriticalPoints would be the array containing the index of every 1 in either str1 or str2. Make sure that it's sorted before you binary search. Keep in mind those gochya's. My logic was: Okay I need to make sure to calculate the answer at the index <code>best</code> itself, so let's go with <code>best + 1</code> as the parameter. But, if <code>best == i - 1</code>, we get stuck in a loop. I'll handle that with a quick <code>str1[i-1] == 1</code> check. Done, phew. You can do a quick check for correctness by noting that at worst case you will hit all 200*100000 combinations of i and j that make critical points, and when those critical points call <code>min(a, b, c)</code>, it only makes three recursive function calls. If any of those functions are critical points, then it's part of those 200*100000 we already counted and we can ignore it. If it's not, then in O(log(200)) it falls into a single call on another critical point (Now, it's something we know is part of the 200*100000 we already counted). Thus, each critical point takes at worst <code>3*log(200)</code> time excluding calls to other critical points. Similarly, the very first function call will fall into a critical point in <code>log(200)</code> time. Thus, we have an upper bound of O(200*100000*3*log(200) + log(200)). Also, make sure your memo table is a hashmap, not an array. 10^10 memory will not fit on your computer. <hr> <h3>Solution 2</h3> You know the answer is at most 200, so just prevent yourself from computing more than that many operations deep. <pre class="prettyprint"><code>dp(int i, int j) { // O(100000 * 205), sounds good to me. if( abs(i - j) > 205 ) return 205; // The answer in this case is at least 205, so it's irrelevant to calculating the answer because when min is called, it wont be smallest. // memo & base case if( str1[i-1] == str1[j-1] ) { return dp(i-1, j-1); } return 1 + min( dp(i-1, j), dp(i-1, j-1), dp(i, j-1) ); } </code></pre> This one is very simple, but I leave it for solution two because this solution seems to have come out from thin air, as opposed to analyzing the problem and figuring out where the slow part is and how to optimize it. Keep this in your toolbox though, since you should be coding this solution. <hr> <h3>Solution 3</h3> Just like Solution 2, we could have done it like this: <pre class="prettyprint"><code>dp(int i, int j, int threshold = 205) { if( threshold == 0 ) return 205; // memo & base case if( str1[i-1] == str1[j-1] ) { return dp(i-1, j-1); } return 1 + min( dp(i-1, j, threshold - 1), dp(i-1, j-1, threshold - 1), dp(i, j-1, threshold - 1) ); } </code></pre> You might be worried about dp(i-1, j-1) trickling down, but the threshold keeps i and j close together so it calculates a subset of Solution 2. This is because the threshold gets decremented every time i and j get farther apart. <code>dp(i-1, j-1, threshold)</code> would make it identical to Solution 2 (Thus, this one is slightly faster). <hr> <h3>Space</h3> These solutions will give you the answer very quickly, but if you want a space-optimizing solution as well, it would be easy to replace <code>str1[i]</code> with <code>(i in Str1CriticalPoints) ? 1 : 0</code>, using a hashmap. This would give a final solution that is still very fast (Though will be 10x slower), and also avoids keeping the long strings in memory (To the point where it could run on an Arduino). I don't think this is necessary though. Note that the original solution does not use 10^10 space. You mention "even better, less than 10^10 space", with an implication that 10^10 space would be acceptable. Unfortunately, even with enough RAM, iterating though that space takes 10^10 time, which is definitely not acceptable. None of my solutions use 10^10 space: only 2 * 10^5 to hold the strings - which can be avoided as discussed above. 10^10 Bytes it 10 GB for reference. <hr> EDIT: As maniek notes, you only need to check <code>abs(i - j) > 105</code>, as the remaining 100 insertions needed to equate <code>i</code> and <code>j</code> will pull the number of operations above 200.

Is there a sparse edit distance algorithm?

Tags:

algorithm

levenshtein-distance

Say you have two strings of length 100,000 containing zeros and ones. You can compute their edit distance in roughly 10^10 operations.

If each string only has 100 ones and the rest are zeros then I can represent each string using 100 integers saying where the ones are.

Is there a much faster algorithm to compute the edit distance using this sparse representation? Even better would be an algorithm that also uses 100^2 space instead of 10^10 space.

To give something to test on, consider these two strings with 10 ones each. The integers say where the ones are in each string.

[9959, 10271, 12571, 21699, 29220, 39972, 70600, 72783, 81449, 83262]

[9958, 10270, 12570, 29221, 34480, 37952, 39973, 83263, 88129, 94336]

In algorithmic terms, if we have two sparse binary strings of length n each represented by k integers each, does there exist an O(k^2) time edit distance algorithm?

313

asked Aug 03 '18 17:08

graffe

1 Answers

Of course! There are so few possible operations with so many 0s. I mean, the answer is at most 200.

Take a look at

10001010000000001
vs       ||||||
10111010100000010

Look at all the zeroes with pipes. Does it matter which one out of those you end up deleting? Nope. That's the key.

Solution 1

Let's consider the normal n*m solution:

dp(int i, int j) {
    // memo & base case
    if( str1[i-1] == str1[j-1] ) {
        return dp(i-1, j-1);
    }
    return 1 + min( dp(i-1, j), dp(i-1, j-1), dp(i, j-1) );
}

If almost every single character was a 0, what would hog the most amount of time?

if( str1[i-1] == str1[j-1] ) { // They will be equal so many times, (99900)^2 times!
    return dp(i-1, j-1);
}

I could imagine that trickling down for tens of thousands of recursions. All you actually need logic for are the ~200 critical points. You can ignore the rest. A simple modification would be

if( str1[i-1] == str1[j-1] ) {
    if( str1[i-1] == 1 )
        return dp(i-1, j-1); // Already hit a critical point

    // rightmost location of a 1 in str1 or str2, that is <= i-1
    best = binarySearch(CriticalPoints, i-1);
    return dp(best + 1, best + 1); // Use that critical point
    // Important! best+1 because we still want to compute the answer at best
    // Without it, we would skip over in a case where str1[best] is 1, and str2[best] is 0.
}

CriticalPoints would be the array containing the index of every 1 in either str1 or str2. Make sure that it's sorted before you binary search. Keep in mind those gochya's. My logic was: Okay I need to make sure to calculate the answer at the index best itself, so let's go with best + 1 as the parameter. But, if best == i - 1, we get stuck in a loop. I'll handle that with a quick str1[i-1] == 1 check. Done, phew.

You can do a quick check for correctness by noting that at worst case you will hit all 200*100000 combinations of i and j that make critical points, and when those critical points call min(a, b, c), it only makes three recursive function calls. If any of those functions are critical points, then it's part of those 200*100000 we already counted and we can ignore it. If it's not, then in O(log(200)) it falls into a single call on another critical point (Now, it's something we know is part of the 200*100000 we already counted). Thus, each critical point takes at worst 3*log(200) time excluding calls to other critical points. Similarly, the very first function call will fall into a critical point in log(200) time. Thus, we have an upper bound of O(200*100000*3*log(200) + log(200)).

Also, make sure your memo table is a hashmap, not an array. 10^10 memory will not fit on your computer.

Solution 2

You know the answer is at most 200, so just prevent yourself from computing more than that many operations deep.

dp(int i, int j) { // O(100000 * 205), sounds good to me.
    if( abs(i - j) > 205 )
        return 205; // The answer in this case is at least 205, so it's irrelevant to calculating the answer because when min is called, it wont be smallest.
    // memo & base case
    if( str1[i-1] == str1[j-1] ) {
        return dp(i-1, j-1);
    }
    return 1 + min( dp(i-1, j), dp(i-1, j-1), dp(i, j-1) );
}

This one is very simple, but I leave it for solution two because this solution seems to have come out from thin air, as opposed to analyzing the problem and figuring out where the slow part is and how to optimize it. Keep this in your toolbox though, since you should be coding this solution.

Solution 3

Just like Solution 2, we could have done it like this:

dp(int i, int j, int threshold = 205) {
    if( threshold == 0 )
        return 205;
    // memo & base case
    if( str1[i-1] == str1[j-1] ) {
        return dp(i-1, j-1);
    }
    return 1 + min( dp(i-1, j, threshold - 1), dp(i-1, j-1, threshold - 1), dp(i, j-1, threshold - 1) );
}

You might be worried about dp(i-1, j-1) trickling down, but the threshold keeps i and j close together so it calculates a subset of Solution 2. This is because the threshold gets decremented every time i and j get farther apart. dp(i-1, j-1, threshold) would make it identical to Solution 2 (Thus, this one is slightly faster).

Space

These solutions will give you the answer very quickly, but if you want a space-optimizing solution as well, it would be easy to replace str1[i] with (i in Str1CriticalPoints) ? 1 : 0, using a hashmap. This would give a final solution that is still very fast (Though will be 10x slower), and also avoids keeping the long strings in memory (To the point where it could run on an Arduino). I don't think this is necessary though.

Note that the original solution does not use 10^10 space. You mention "even better, less than 10^10 space", with an implication that 10^10 space would be acceptable. Unfortunately, even with enough RAM, iterating though that space takes 10^10 time, which is definitely not acceptable. None of my solutions use 10^10 space: only 2 * 10^5 to hold the strings - which can be avoided as discussed above. 10^10 Bytes it 10 GB for reference.

EDIT: As maniek notes, you only need to check abs(i - j) > 105, as the remaining 100 insertions needed to equate i and j will pull the number of operations above 200.

113

answered Oct 07 '22 18:10

Nicholas Pipitone

Related questions
                            
                                Simplify Regular Expression in Mathematica
                            
                                Implementing a dynamic multiple timeline queue
                            
                                Utilising Genetic algorithm to overcome different size datasets in model
                            
                                Find Kth Smallest Pair Distance - Analysis
                            
                                How do I go about building a matching algorithm?
                            
                                Interesting sorting problem
                            
                                Algorithm to get changes between two arrays
                            
                                Find the longest subarray that contains a majority element
                            
                                How to manage large set of data on a mobile device
                            
                                Creating matrix of numbers unique in row and column
                            
                                How can I build the post-dominator tree of a function with an endless loop?
                            
                                Bitwise Interval Arithmetic
                            
                                Go-Back-N window size
                            
                                How to make space complexity as O(1)
                            
                                Find the maximum product that can be formed by taking any one element from each sub-array
                            
                                Multiple Constraint Knapsack Problem
                            
                                Algorithm for fitting objects in a space
                            
                                Algorithm to calculate the distances between many geo points
                            
                                Optimizing performance of GrabCut in opencv-java
                            
                                Trie tree match performance in word search

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With