Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect period of unknown source

How to detect repeating digits in an infinite sequence? I tried Floyd & Brent detection algorithm but come to nothing... I have a generator that yields numbers ranging from 0 to 9 (inclusive) and I have to recognize a period in it.

Example test case:

import itertools

# of course this is a fake one just to offer an example
def source():
    return itertools.cycle((1, 0, 1, 4, 8, 2, 1, 3, 3, 1))

>>> gen = source()
>>> period(gen)
(1, 0, 1, 4, 8, 2, 1, 3, 3, 1)
like image 780
rubik Avatar asked Jun 26 '12 10:06

rubik


4 Answers

Empirical methods

Here's a fun take on the problem. The more general form of your question is this:

Given a repeating sequence of unknown length, determine the period of the signal.

The process to determine the repeating frequencies is known as the Fourier Transform. In your example case the signal is clean and discrete, but the following solution will work even with continuous noisy data! The FFT will try to duplicate the frequencies of the input signal by approximating them in the so-called "wave-space" or "Fourier-space". Basically a peak in this space corresponds to a repeating signal. The period of your signal is related to the longest wavelength that is peaked.

import itertools

# of course this is a fake one just to offer an example
def source():
    return itertools.cycle((1, 0, 1, 4, 8, 2, 1, 3, 3, 2))

import pylab as plt
import numpy as np
import scipy as sp

# Generate some test data, i.e. our "observations" of the signal
N = 300
vals = source()
X = np.array([vals.next() for _ in xrange(N)])

# Compute the FFT
W    = np.fft.fft(X)
freq = np.fft.fftfreq(N,1)
 
# Look for the longest signal that is "loud"
threshold = 10**2
idx = np.where(abs(W)>threshold)[0][-1]

max_f = abs(freq[idx])
print "Period estimate: ", 1/max_f

This gives the correct answer for this case, 10 though if N didn't divide the cycles cleanly, you would get a close estimate. We can visualize this via:

plt.subplot(211)
plt.scatter([max_f,], [np.abs(W[idx]),], s=100,color='r')
plt.plot(freq[:N/2], abs(W[:N/2]))
plt.xlabel(r"$f$")

plt.subplot(212)
plt.plot(1.0/freq[:N/2], abs(W[:N/2]))
plt.scatter([1/max_f,], [np.abs(W[idx]),], s=100,color='r')
plt.xlabel(r"$1/f$")
plt.xlim(0,20)

plt.show()

enter image description here

like image 57
Hooked Avatar answered Nov 07 '22 11:11

Hooked


Evgeny Kluev's answer provides a way to get an answer that might be right.

Definition

Let's assume you have some sequence D that is a repeating sequence. That is there is some sequence d of length L such that: D_i = d_{i mod L}, where t_i is the ith element of sequence t that is numbered from 0. We will say sequence d generates D.

Theorem

Given a sequence D which you know is generated by some finite sequence t. Given some d it is impossible to decide in finite time whether it generates D.

Proof

Since we are only allowed a finite time we can only access a finite number of elements of D. Let us suppose we access the first F elements of D. We chose the first F because if we are only allowed to access a finite number, the set containing the indices of the elements we've accessed is finite and hence has a maximum. Let that maximum be M. We can then let F = M+1, which is still a finite number.

Let L be the length of d and that D_i = d_{i mod L} for i < F. There are two possibilities for D_F it is either the same as d_{F mod L} or it is not. In the former case d seems to work, but in the latter case it does not. We cannot know which case is true until we access D_F. This will however require accessing F+1 elements, but we are limited to F element accesses.

Hence, for any F we won't have enough information to decide whether d generates D and therefore it is impossible to know in finite time whether d generates D.

Conclusions

It is possible to know in finite time that a sequence d does not generate D, but this doesn't help you. Since you want to find a sequence d that does generate D, but this involves amongst other things being able to prove that some sequence generates D.

Unless you have more information about D this problem is unsolvable. One bit of information that will make this problem decidable is some upper bound on the length of the shortest d that generates D. If you know the function generating D only has a known amount of finite state you can calculate this upper bound.

Hence, my conclusion is that you cannot solve this problem unless you change the specification a bit.

like image 34
JPvdMerwe Avatar answered Nov 07 '22 13:11

JPvdMerwe


I have no idea about proper algorithms to apply here, but my understanding also is that you can never know for sure that you've detected a period if you have consumed only a finite number of terms. Anyway, here's what I've come up with, this is a very naive implementation, more to educate from the comments than to provide a good solution (I guess).

def guess_period(source, minlen=1, maxlen=100, trials=100):
    for n in range(minlen, maxlen+1):
        p = [j for i, j in zip(range(n), source)]
        if all([j for i, j in zip(range(n), source)] == p
               for k in range(trials)):
            return tuple(p)
    return None

This one, however, "forgets" the initial order and returns a tuple that is a cyclic permutation of the actual period:

In [101]: guess_period(gen)
Out[101]: (0, 1, 4, 8, 2, 1, 3, 3, 1, 1)

To compensate for this, you'll need to keep track of the offset.

like image 22
Lev Levitsky Avatar answered Nov 07 '22 12:11

Lev Levitsky


Since your sequence is not of the form Xn+1 = f(Xn), Floyd's or Brent's algorithms are not directly applicable to your case. But they may be extended to do the task:

  1. Use Floyd's or Brent's algorithm to find some repeating element of the sequence.
  2. Find next sequence element with the same value. Distance between these elements is a supposed period (k).
  3. Remember next k elements of the sequence
  4. Find the next occurrence of this k-element subsequence.
  5. If distance between subsequences is greater than k, update k and continue with the step 3.
  6. Repeat step 4 several times to verify the result. If maximum length of the repeating sub-sequence is known a-priori, use appropriate number of repetitions. Otherwise use as much repetitions as possible, because each repetition increases the result's correctness.

If the sequence cycling starts from the first element, ignore step 1 and start from step 2 (find next sequence element equal to the first element).

If the sequence cycling does not start from the first element, it is possible that Floyd's or Brent's algorithm finds some repeating element of the sequence that does not belong to the cycle. So it is reasonable to limit the number of iterations in steps 2 and 4, and if this limit is exceeded, continue with the step 1.

like image 41
Evgeny Kluev Avatar answered Nov 07 '22 11:11

Evgeny Kluev