I have recently came up against this code lacking any comment. It finds minimal cyclic shift of word (this code specifically returns its index in string) and its called Duval algorithm. Only info I found describes algorithm in few words and has cleaner code. I would appreciate any help in understanding this algorithm. I have always found text algorithms pretty tricky and rather hard to understand.
int minLexCyc(const char *x) {
int i = 0, j = 1, k = 1, p = 1, a, b, l = strlen(x);
while(j+k <= (l<<1)) {
if ((a=x[(i+k-1)%l])>(b=x[(j+k-1)%l])) {
i=j++;
k=p=1;
} else if (a<b) {
j+=k;
k=1;
p=j-i;
} else if (a==b && k!=p) {
k++;
} else {
j+=p;
k=1;
}
}
return i;
}
First, I believe that your code has a bug in it. The last line should be
return p;
. I beleve that i holds the index of the lexicographically smallest cyclic shift, and p holds the smallest shift that matches. I also think that your stopping condition is too weak, i.e. you are doing too much checking after you have found a match, but I am not sure exactly what it should be.
Note that i and j only advance and that i is always less than j. We are looking for a string that matches the string starting at i, and we are trying to match it with a string that starts at j. We do this by comparing the k'th character of each string while increasing k (as long as they match). Note that we only change i if we determine that the string starting at j is lexicographically less than the string starting at j, and then we set i to j and reset k and p to their initial values.
I do not have time for a detailed analysis, but it looks like
Edit Going further
this section of code
if ((a=x[(i+k-1)%l])>(b=x[(j+k-1)%l])) {
i=j++;
k=p=1;
Moves the start of the comparison to a lexicographically earlier string when we find one and reinitializes everything else.
this section
} else if (a<b) {
j+=k;
k=1;
p=j-i;
is the tricky part. We have found a mismatch that is lexicographically later than our reference string, so we skip to the end of the text matched so far, and start matching from there. We also increase p (our stride). Why can we skip over all the starting points between j and j + k? This is because the string starting with i is the lexicographically smallest seen, and if the tail of the current j string is greater then the string at i then any suffix of the string at j will be greater than the string at i.
Finally
} else if (a==b && k!=p) {
k++;
} else {
j+=p;
k=1;
this just checks that the string of length p starting at i repeats.
**further edit*
We do this by incrementing k until k == p
, checking that the k'th character of the string starting at i equals the k'th character of the string starting at j. Once k reaches p we start scanning again at the next supposed occurrence of the string.
Even further edit to attempt to answer jethro's questions.
First: the k != p
in else if (a==b && k!=p)
Here we have a match in that the k'th and all previous characters in the strings starting at i and j are equal. The variable p represents the length that we think that the repeating string is. When k != p
, actually k < p
, so we are ensuring that the p characters at the string beginning at i are the same as the p characters of the string beginning at j. When k == p
(the final else) we should be at a point where the string starting at j + k
looks the same as the string starting at j, so we increase j by p and set k back to 1 and go back to comparing the two strings.
Second: Yes, I believe you are correct, it should return i. I was misunderstanding the meaning of "Minimum Cyclic Shift"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With