Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect whether sequence is a multiple of a subsequence in Python

I have a tuple of zeros and ones, for instance:

(1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1)

It turns out:

(1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1) == (1, 0, 1, 1) * 3

I want a function f such that if s is a non-empty tuple of zeros and ones, f(s) is the shortest subtuple r such that s == r * n for some positive integer n.

So for instance,

f( (1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1) ) == (1, 0, 1, 1)

What is a slick way to write the function f in Python?

Edit:

The naive method I am currently using

def f(s):
  for i in range(1,len(s)):
    if len(s)%i == 0 and s == s[:i] * (len(s)/i):
      return s[:i]
like image 973
math4tots Avatar asked Mar 11 '13 06:03

math4tots


1 Answers

I believe I have an O(n) time solution (actually 2n+r, n is length of tuple, r is sub tuplle) which does not use suffix trees, but uses a string matching algorithm (like KMP, which you should find off-the shelf).

We use the following little known theorem:

If x,y are strings over some alphabet,

then xy = yx if and only if x = z^k and y = z^l for some string z and integers k,l.

I now claim that, for the purposes of our problem, this means that all we need to do is determine if the given tuple/list (or string) is a cyclic shift of itself!

To determine if a string is a cyclic shift of itself, we concatenate it with itself (it does not even have to be a real concat, just a virtual one will do) and check for a substring match (with itself).

For a proof of that, suppose the string is a cyclic shift of itself.

The we have that the given string y = uv = vu. Since uv = vu, we must have that u = z^k and v= z^l and hence y = z^{k+l} from the above theorem. The other direction is easy to prove.

Here is the python code. The method is called powercheck.

def powercheck(lst):
    count = 0
    position = 0
    for pos in KnuthMorrisPratt(double(lst), lst):
        count += 1
        position = pos
        if count == 2:
            break

    return lst[:position]


def double(lst):
    for i in range(1,3):
        for elem in lst:
            yield elem

def main():
    print powercheck([1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1])

if __name__ == "__main__":
    main()

And here is the KMP code which I used (due to David Eppstein).

# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos

For your sample this outputs

[1,0,1,1]

as expected.

I compared this against shx2's code(not the numpy one), by generating a random 50 bit string, then replication to make the total length as 1 million. This was the output (the decimal number is the output of time.time())

1362988461.75

(50, [1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1])

1362988465.96

50 [1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1]

1362988487.14

The above method took ~4 seconds, while shx2's method took ~21 seconds!

Here was the timing code. (shx2's method was called powercheck2).

def rand_bitstring(n):
    rand = random.SystemRandom()
    lst = []
    for j in range(1, n+1):
        r = rand.randint(1,2)
        if r == 2:
            lst.append(0)
        else:
            lst.append(1)

    return lst

def main():
    lst = rand_bitstring(50)*200000
    print time.time()
    print powercheck(lst)
    print time.time()
    powercheck2(lst)
    print time.time()
like image 177
Knoothe Avatar answered Sep 30 '22 14:09

Knoothe