Inspired by these two questions: String manipulation: calculate the "similarity of a string with its suffixes" and Program execution varies as the I/P size increases beyond 5 in C, I came up with the below algorithm.
The questions will be
A bit of context first. For two strings, define their similarity as the length of the longest common prefix of the two. The total self-similarity of a string s is the sum of the similarities of s with all of its suffixes. So for example, the total self-similarity of abacab is 6 + 0 + 1 + 0 + 2 + 0 = 9 and the total self-similarity of a repeated n
times is n*(n+1)/2
.
Description of the algorithm: The algorithm is based on the Knuth-Morris-Pratt string searching algorithm, in that the borders of the string's prefixes play the central role.
To recapitulate: a border of a string s is a proper substring b of s which is simultaneously a prefix and a suffix of s.
Remark: If b and c are borders of s with b shorter than c, then b is also a border of c, and conversely, every border of c is also a border of s.
Let s be a string of length n and p be a prefix of s with length i. We call a border b with width k of p non-extensible if either i == n
or s[i] != s[k]
, otherwise it's extensible (the length k+1
prefix of s is then a border of the length i+1
prefix of s).
Now, if the longest common prefix of s and the suffix starting with s[i], i > 0
, has length k, the length k prefix of s is a non-extensible border of the length i+k prefix of s. It is a border because it's a common prefix of s and s[i .. n-1]
, and if it were extensible, it wouldn't be the longest common prefix.
Conversely, every non-extensible border (of length k) of the length i prefix of s is the longest common prefix of s and the suffix starting with s[i-k]
.
So we can calculate the total self-similarity of s by summing the lengths of all non-extensible borders of the length i prefixes of s, 1 <= i <= n
. To do that
1 <= i <= n
, if p = s[0 .. i-1]
has a non-empty non-extensible border, let b be the widest of these, add the width of b and for all non-empty borders c of b, if it is a non-extensible border of p, add its length.Code (C):
#include <stdlib.h> #include <stdio.h> #include <string.h> /* * Overflow and NULL checks omitted to not clutter the algorithm. */ int similarity(char *text){ int *borders, *ne_borders, len = strlen(text), i, j, sim; borders = malloc((len+1)*sizeof(*borders)); ne_borders = malloc((len+1)*sizeof(*ne_borders)); i = 0; j = -1; borders[i] = j; ne_borders[i] = j; /* * Find the length of the widest borders of prefixes of text, * standard KMP way, O(len). */ while(i < len){ while(j >= 0 && text[i] != text[j]){ j = borders[j]; } ++i, ++j; borders[i] = j; } /* * For each prefix, find the length of its widest non-extensible * border, this part is also O(len). */ for(i = 1; i <= len; ++i){ j = borders[i]; /* * If the widest border of the i-prefix has width j and is * extensible (text[i] == text[j]), the widest non-extensible * border of the i-prefix is the widest non-extensible border * of the j-prefix. */ if (text[i] == text[j]){ j = ne_borders[j]; } ne_borders[i] = j; } /* The longest common prefix of text and text is text. */ sim = len; for(i = len; i > 0; --i){ /* * If a longest common prefix of text and one of its suffixes * ends right before text[i], it is a non-extensible border of * the i-prefix of text, and conversely, every non-extensible * border of the i-prefix is a longest common prefix of text * and one of its suffixes. * * So, if the i-prefix has any non-extensible border, we must * sum the lengths of all these. Starting from the widest * non-extensible border, we must check all of its non-empty * borders for extendibility. * * Can this introduce nonlinearity? How many extensible borders * shorter than the widest non-extensible border can a prefix have? */ if ((j = ne_borders[i]) > 0){ sim += j; while(j > 0){ j = borders[j]; if (text[i] != text[j]){ sim += j; } } } } free(borders); free(ne_borders); return sim; } /* The naive algorithm for comparison */ int common_prefix(char *text, char *suffix){ int c = 0; while(*suffix && *suffix++ == *text++) ++c; return c; } int naive_similarity(char *text){ int len = (int)strlen(text); int i, sim = 0; for(i = 0; i < len; ++i){ sim += common_prefix(text,text+i); } return sim; } int main(int argc, char *argv[]){ int i; for(i = 1; i < argc; ++i){ printf("%d\n",similarity(argv[i])); } for(i = 1; i < argc; ++i){ printf("%d\n",naive_similarity(argv[i])); } return EXIT_SUCCESS; }
So, is this correct? I'd be rather surprised if not, but I've been wrong before.
What is the worst case complexity of the algorithm?
I think it's O(n), but I haven't yet found a proof that the number of extensible borders a prefix can have contained in its widest non-extensible border is bounded (or rather, that the total number of such occurrences is O(n)).
I'm most interested in sharp bounds, but if you can prove that it's e.g. O(n*log n) or O(n^(1+x)) for small x
, that's already good. (It's obviously at worst quadratic, so an answer of "It's O(n^2)" is only interesting if accompanied by an example for quadratic or near-quadratic behaviour.)
Linear Search is defined as a sequential search algorithm that starts at one end and goes through each element of a list until the desired element is found, otherwise the search continues till the end of the data set. It is the easiest searching algorithm.
Linear search is a very simple search algorithm. In this type of search, a sequential search is made over all items one by one. Every item is checked and if a match is found then that particular item is returned, otherwise the search continues till the end of the data collection.
A linear search is the simplest method of searching a data set. Starting at the beginning of the data set, each item of data is examined until a match is made. Once the item is found, the search ends.
This looks like a really neat idea, but sadly I believe the worst case behaviour is O(n^2).
Here is my attempt at a counterexample. (I'm not a mathematician so please forgive my use of Python instead of equations to express my ideas!)
Consider the string with 4K+1 symbols
s = 'a'*K+'X'+'a'*3*K
This will have
borders[1:] = range(K)*2+[K]*(2*K+1) ne_borders[1:] = [-1]*(K-1)+[K-1]+[-1]*K+[K]*(2*K+1)
Note that:
1) ne_borders[i] will equal K for (2K+1) values of i.
2) for 0<=j<=K, borders[j]=j-1
3) the final loop in your algorithm will go into the inner loop with j==K for 2K+1 values of i
4) the inner loop will iterate K times to reduce j to 0
5) This results in the algorithm needing more than N*N/8 operations to do a worst case string of length N.
For example, for K=4 it goes round the inner loop 39 times
s = 'aaaaXaaaaaaaaaaaa' borders[1:] = [0, 1, 2, 3, 0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4] ne_borders[1:] = [-1, -1, -1, 3, -1, -1, -1, -1, 4, 4, 4, 4, 4, 4, 4, 4, 4]
For K=2,248 it goes round the inner loop 10,111,503 times!
Perhaps there is a way to fix the algorithm for this case?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With