Inspired by these two questions: String manipulation: calculate the "similarity of a string with its suffixes" and Program execution varies as the I/P size increases beyond 5 in C, I came up with the below algorithm. The questions will be <ol> <li>Is it correct, or have I made a mistake in my reasoning?</li> <li>What is the worst case complexity of the algorithm?</li> </ol> A bit of context first. For two strings, define their similarity as the length of the longest common prefix of the two. The total self-similarity of a string s is the sum of the similarities of s with all of its suffixes. So for example, the total self-similarity of abacab is 6 + 0 + 1 + 0 + 2 + 0 = 9 and the total self-similarity of a repeated <code>n</code> times is <code>n*(n+1)/2</code>. Description of the algorithm: The algorithm is based on the Knuth-Morris-Pratt string searching algorithm, in that the borders of the string's prefixes play the central role. To recapitulate: a border of a string s is a proper substring b of s which is simultaneously a prefix and a suffix of s. Remark: If b and c are borders of s with b shorter than c, then b is also a border of c, and conversely, every border of c is also a border of s. Let s be a string of length n and p be a prefix of s with length i. We call a border b with width k of p non-extensible if either <code>i == n</code> or <code>s[i] != s[k]</code>, otherwise it's extensible (the length <code>k+1</code> prefix of s is then a border of the length <code>i+1</code> prefix of s). Now, if the longest common prefix of s and the suffix starting with <code>s[i], i > 0</code>, has length k, the length k prefix of s is a non-extensible border of the length i+k prefix of s. It is a border because it's a common prefix of s and <code>s[i .. n-1]</code>, and if it were extensible, it wouldn't be the longest common prefix. Conversely, every non-extensible border (of length k) of the length i prefix of s is the longest common prefix of s and the suffix starting with <code>s[i-k]</code>. So we can calculate the total self-similarity of s by summing the lengths of all non-extensible borders of the length i prefixes of s, <code>1 <= i <= n</code>. To do that <ol> <li>Calculate the width of the widest borders of the prefixes by the standard KMP preprocessing step.</li> <li>Calculate the width of the widest non-extensible borders of the prefixes.</li> <li>For each i, <code>1 <= i <= n</code>, if <code>p = s[0 .. i-1]</code> has a non-empty non-extensible border, let b be the widest of these, add the width of b and for all non-empty borders c of b, if it is a non-extensible border of p, add its length.</li> <li>Add the length n of s, since that isn't covered by the above.</li> </ol> Code (C): <pre class="prettyprint"><code>#include <stdlib.h> #include <stdio.h> #include <string.h> /* * Overflow and NULL checks omitted to not clutter the algorithm. */ int similarity(char *text){ int *borders, *ne_borders, len = strlen(text), i, j, sim; borders = malloc((len+1)*sizeof(*borders)); ne_borders = malloc((len+1)*sizeof(*ne_borders)); i = 0; j = -1; borders[i] = j; ne_borders[i] = j; /* * Find the length of the widest borders of prefixes of text, * standard KMP way, O(len). */ while(i < len){ while(j >= 0 && text[i] != text[j]){ j = borders[j]; } ++i, ++j; borders[i] = j; } /* * For each prefix, find the length of its widest non-extensible * border, this part is also O(len). */ for(i = 1; i <= len; ++i){ j = borders[i]; /* * If the widest border of the i-prefix has width j and is * extensible (text[i] == text[j]), the widest non-extensible * border of the i-prefix is the widest non-extensible border * of the j-prefix. */ if (text[i] == text[j]){ j = ne_borders[j]; } ne_borders[i] = j; } /* The longest common prefix of text and text is text. */ sim = len; for(i = len; i > 0; --i){ /* * If a longest common prefix of text and one of its suffixes * ends right before text[i], it is a non-extensible border of * the i-prefix of text, and conversely, every non-extensible * border of the i-prefix is a longest common prefix of text * and one of its suffixes. * * So, if the i-prefix has any non-extensible border, we must * sum the lengths of all these. Starting from the widest * non-extensible border, we must check all of its non-empty * borders for extendibility. * * Can this introduce nonlinearity? How many extensible borders * shorter than the widest non-extensible border can a prefix have? */ if ((j = ne_borders[i]) > 0){ sim += j; while(j > 0){ j = borders[j]; if (text[i] != text[j]){ sim += j; } } } } free(borders); free(ne_borders); return sim; } /* The naive algorithm for comparison */ int common_prefix(char *text, char *suffix){ int c = 0; while(*suffix && *suffix++ == *text++) ++c; return c; } int naive_similarity(char *text){ int len = (int)strlen(text); int i, sim = 0; for(i = 0; i < len; ++i){ sim += common_prefix(text,text+i); } return sim; } int main(int argc, char *argv[]){ int i; for(i = 1; i < argc; ++i){ printf("%d\n",similarity(argv[i])); } for(i = 1; i < argc; ++i){ printf("%d\n",naive_similarity(argv[i])); } return EXIT_SUCCESS; } </code></pre> So, is this correct? I'd be rather surprised if not, but I've been wrong before. What is the worst case complexity of the algorithm? I think it's O(n), but I haven't yet found a proof that the number of extensible borders a prefix can have contained in its widest non-extensible border is bounded (or rather, that the total number of such occurrences is O(n)). I'm most interested in sharp bounds, but if you can prove that it's e.g. O(n*log n) or O(n^(1+x)) for small <code>x</code>, that's already good. (It's obviously at worst quadratic, so an answer of "It's O(n^2)" is only interesting if accompanied by an example for quadratic or near-quadratic behaviour.)

This looks like a really neat idea, but sadly I believe the worst case behaviour is O(n^2). Here is my attempt at a counterexample. (I'm not a mathematician so please forgive my use of Python instead of equations to express my ideas!) Consider the string with 4K+1 symbols <pre class="prettyprint"><code>s = 'a'*K+'X'+'a'*3*K </code></pre> This will have <pre class="prettyprint"><code>borders[1:] = range(K)*2+[K]*(2*K+1) ne_borders[1:] = [-1]*(K-1)+[K-1]+[-1]*K+[K]*(2*K+1) </code></pre> Note that: 1) ne_borders[i] will equal K for (2K+1) values of i. 2) for 0<=j<=K, borders[j]=j-1 3) the final loop in your algorithm will go into the inner loop with j==K for 2K+1 values of i 4) the inner loop will iterate K times to reduce j to 0 5) This results in the algorithm needing more than N*N/8 operations to do a worst case string of length N. For example, for K=4 it goes round the inner loop 39 times <pre class="prettyprint"><code>s = 'aaaaXaaaaaaaaaaaa' borders[1:] = [0, 1, 2, 3, 0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4] ne_borders[1:] = [-1, -1, -1, 3, -1, -1, -1, -1, 4, 4, 4, 4, 4, 4, 4, 4, 4] </code></pre> For K=2,248 it goes round the inner loop 10,111,503 times! Perhaps there is a way to fix the algorithm for this case?

Is this algorithm linear?

Q: What is a linear algorithm?

Linear Search is defined as a sequential search algorithm that starts at one end and goes through each element of a list until the desired element is found, otherwise the search continues till the end of the data set. It is the easiest searching algorithm.

Q: What is linear search algorithm with example?

Linear search is a very simple search algorithm. In this type of search, a sequential search is made over all items one by one. Every item is checked and if a match is found then that particular item is returned, otherwise the search continues till the end of the data collection.

Q: What is meant by linear search?

A linear search is the simplest method of searching a data set. Starting at the beginning of the data set, each item of data is examined until a match is made. Once the item is found, the search ends.

Tags:

c

algorithm

big-o

Inspired by these two questions: String manipulation: calculate the "similarity of a string with its suffixes" and Program execution varies as the I/P size increases beyond 5 in C, I came up with the below algorithm.

The questions will be

Is it correct, or have I made a mistake in my reasoning?
What is the worst case complexity of the algorithm?

A bit of context first. For two strings, define their similarity as the length of the longest common prefix of the two. The total self-similarity of a string s is the sum of the similarities of s with all of its suffixes. So for example, the total self-similarity of abacab is 6 + 0 + 1 + 0 + 2 + 0 = 9 and the total self-similarity of a repeated n times is n*(n+1)/2.

Description of the algorithm: The algorithm is based on the Knuth-Morris-Pratt string searching algorithm, in that the borders of the string's prefixes play the central role.

To recapitulate: a border of a string s is a proper substring b of s which is simultaneously a prefix and a suffix of s.

Remark: If b and c are borders of s with b shorter than c, then b is also a border of c, and conversely, every border of c is also a border of s.

Let s be a string of length n and p be a prefix of s with length i. We call a border b with width k of p non-extensible if either i == n or s[i] != s[k], otherwise it's extensible (the length k+1 prefix of s is then a border of the length i+1 prefix of s).

Now, if the longest common prefix of s and the suffix starting with s[i], i > 0, has length k, the length k prefix of s is a non-extensible border of the length i+k prefix of s. It is a border because it's a common prefix of s and s[i .. n-1], and if it were extensible, it wouldn't be the longest common prefix.

Conversely, every non-extensible border (of length k) of the length i prefix of s is the longest common prefix of s and the suffix starting with s[i-k].

So we can calculate the total self-similarity of s by summing the lengths of all non-extensible borders of the length i prefixes of s, 1 <= i <= n. To do that

Calculate the width of the widest borders of the prefixes by the standard KMP preprocessing step.
Calculate the width of the widest non-extensible borders of the prefixes.
For each i, 1 <= i <= n, if p = s[0 .. i-1] has a non-empty non-extensible border, let b be the widest of these, add the width of b and for all non-empty borders c of b, if it is a non-extensible border of p, add its length.
Add the length n of s, since that isn't covered by the above.

Code (C):

Click to copy

#include <stdlib.h> #include <stdio.h> #include <string.h>  /*  * Overflow and NULL checks omitted to not clutter the algorithm.  */  int similarity(char *text){     int *borders, *ne_borders, len = strlen(text), i, j, sim;     borders = malloc((len+1)*sizeof(*borders));     ne_borders = malloc((len+1)*sizeof(*ne_borders));     i = 0;     j = -1;     borders[i] = j;     ne_borders[i] = j;     /*      * Find the length of the widest borders of prefixes of text,      * standard KMP way, O(len).      */     while(i < len){         while(j >= 0 && text[i] != text[j]){             j = borders[j];         }         ++i, ++j;         borders[i] = j;     }     /*      * For each prefix, find the length of its widest non-extensible      * border, this part is also O(len).      */     for(i = 1; i <= len; ++i){         j = borders[i];         /*          * If the widest border of the i-prefix has width j and is          * extensible (text[i] == text[j]), the widest non-extensible          * border of the i-prefix is the widest non-extensible border          * of the j-prefix.          */         if (text[i] == text[j]){             j = ne_borders[j];         }         ne_borders[i] = j;     }     /* The longest common prefix of text and text is text. */     sim = len;     for(i = len; i > 0; --i){         /*          * If a longest common prefix of text and one of its suffixes          * ends right before text[i], it is a non-extensible border of          * the i-prefix of text, and conversely, every non-extensible          * border of the i-prefix is a longest common prefix of text          * and one of its suffixes.          *          * So, if the i-prefix has any non-extensible border, we must          * sum the lengths of all these. Starting from the widest          * non-extensible border, we must check all of its non-empty          * borders for extendibility.          *          * Can this introduce nonlinearity? How many extensible borders          * shorter than the widest non-extensible border can a prefix have?          */         if ((j = ne_borders[i]) > 0){             sim += j;             while(j > 0){                 j = borders[j];                 if (text[i] != text[j]){                     sim += j;                 }             }         }     }     free(borders);     free(ne_borders);     return sim; }   /* The naive algorithm for comparison */ int common_prefix(char *text, char *suffix){     int c = 0;     while(*suffix && *suffix++ == *text++) ++c;     return c; }  int naive_similarity(char *text){     int len = (int)strlen(text);     int i, sim = 0;     for(i = 0; i < len; ++i){         sim += common_prefix(text,text+i);     }     return sim; }  int main(int argc, char *argv[]){     int i;     for(i = 1; i < argc; ++i){         printf("%d\n",similarity(argv[i]));     }     for(i = 1; i < argc; ++i){         printf("%d\n",naive_similarity(argv[i]));     }     return EXIT_SUCCESS; }

So, is this correct? I'd be rather surprised if not, but I've been wrong before.

What is the worst case complexity of the algorithm?

I think it's O(n), but I haven't yet found a proof that the number of extensible borders a prefix can have contained in its widest non-extensible border is bounded (or rather, that the total number of such occurrences is O(n)).

I'm most interested in sharp bounds, but if you can prove that it's e.g. O(n*log n) or O(n^(1+x)) for small x, that's already good. (It's obviously at worst quadratic, so an answer of "It's O(n^2)" is only interesting if accompanied by an example for quadratic or near-quadratic behaviour.)

788

asked Dec 19 '11 15:12

Daniel Fischer

1 Answers

This looks like a really neat idea, but sadly I believe the worst case behaviour is O(n^2).

Here is my attempt at a counterexample. (I'm not a mathematician so please forgive my use of Python instead of equations to express my ideas!)

Consider the string with 4K+1 symbols

Click to copy

s = 'a'*K+'X'+'a'*3*K

This will have

Click to copy

borders[1:] = range(K)*2+[K]*(2*K+1)  ne_borders[1:] = [-1]*(K-1)+[K-1]+[-1]*K+[K]*(2*K+1)

Note that:

1) ne_borders[i] will equal K for (2K+1) values of i.

2) for 0<=j<=K, borders[j]=j-1

3) the final loop in your algorithm will go into the inner loop with j==K for 2K+1 values of i

4) the inner loop will iterate K times to reduce j to 0

5) This results in the algorithm needing more than N*N/8 operations to do a worst case string of length N.

For example, for K=4 it goes round the inner loop 39 times

Click to copy

s = 'aaaaXaaaaaaaaaaaa' borders[1:] = [0, 1, 2, 3, 0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4] ne_borders[1:] = [-1, -1, -1, 3, -1, -1, -1, -1, 4, 4, 4, 4, 4, 4, 4, 4, 4]

For K=2,248 it goes round the inner loop 10,111,503 times!

Perhaps there is a way to fix the algorithm for this case?

answered Sep 19 '22 12:09

Peter de Rivaz

Related questions
                            
                                Can you #define a comment in C?
                            
                                How to get a list of video capture devices (web cameras) on linux ( ubuntu )? (C/C++)
                            
                                How to determine how many bytes an integer needs?
                            
                                Alternative to Eclipse for C and C++ development? [closed]
                            
                                How to sum large numbers?
                            
                                Why does subtracting '0' in C result in the number that the char is representing?
                            
                                The most useful user-made C-macros (in GCC, also C99)? [closed]
                            
                                How to avoid use of goto and break nested loops efficiently
                            
                                how to get hexdump of a structure data
                            
                                Is C/C++ one language or two languages?
                            
                                What coding techniques do you use for optimising C programs? [closed]
                            
                                Why does n++ execute faster than n=n+1?
                            
                                Integer to IP Address - C
                            
                                Does GCC generate suboptimal code for static branch prediction?
                            
                                Preventing MSYS 'bash' from killing processes that trap ^C
                            
                                Functional programming in C with macro "Higher Order Function" generators
                            
                                Two arguments to calloc
                            
                                Is there a C project Default Directory Layout?
                            
                                Representing an Abstract Syntax Tree in C
                            
                                GCC: why constant variables not placed in .rodata

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is this algorithm linear?

Tags:

c

algorithm

big-o

Daniel Fischer

People also ask

1 Answers

Peter de Rivaz

Recent Activity

Donate For Us