Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the period of a string

I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.

Say

S1 = AAAA // substring is A
S2 = ABAB // Substring is AB
S3 = ABCAB // Substring is ABC
S4 = EFIEFI // Substring is EFI

I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.

like image 890
Aditya Avatar asked Jan 16 '14 17:01

Aditya


2 Answers

You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).

like image 150
tmyklebu Avatar answered Oct 10 '22 19:10

tmyklebu


Let me assume that the length of the string n is at least twice greater than the period p.

Algorithm

  1. Let m = 1, and S the whole string
  2. Take m = m*2
    • Find the next occurrence of the substring S[:m]
    • Let k be the start of the next occurrence
    • Check if S[:k] is the period
    • if not go to 2.

Example

Suppose we have a string

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC

For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD   CDCD   CDCD   CDCD   CD

We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD   CDCD

Now let's extend our string to the first repetition. We get

CDCDFBF

we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC      CDCDFBFC      

we try to extend and we get

CDCDFBFCDCDFDF

and this indeed is our period.

I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.

Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.

A very nice problem though.

like image 39
Łukasz Kidziński Avatar answered Oct 10 '22 21:10

Łukasz Kidziński