Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

Question

I am looking for an algorithm (possibly implemented in Python) able to find the most REPETITIVE sequence in a string. Where for REPETITIVE, I mean any combination of chars that is repeated over and over without interruption (tandem repeat).

The algorithm I am looking for is not the same as the "find the most common word" one. In fact, the repetitive block doesn't need to be the most common word (substring) in the string.

For example:

s = 'asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs'
> f(s)
'UBAUBAUBAUBAUBA' #the "most common word" algo would return 'BA'

Unfortunately, I have no idea on how to tackle this. Any help is very welcome.

UPDATE

A little extra example to clarify that I want to be returned the sequence with the most number of repetition, whatever its basic building block is.

g = 'some noisy spacer'
s = g + 'AB'*5 + g + '_ABCDEF'*2 + g + 'AB'*3
> f(s)
'ABABABABAB' #the one with the most repetitions, not the max len

Examples from @rici:

s = 'aaabcabc'
> f(s)
'abcabc'

s = 'ababcababc'
> f(s)
'ababcababc' #'abab' would also be a solution here
             # since it is repeated 2 times in a row as 'ababcababc'.
             # The proper algorithm would return both solutions.

Serhii Kostel · Accepted Answer

Here is the solution based on ((\w+?)\2+) regex but with additional improvements:

import re
from itertools import chain


def repetitive(sequence, rep_min_len=1):
    """Find the most repetitive sequence in a string.

    :param str sequence: string for search
    :param int rep_min_len: minimal length of repetitive substring
    :return the most repetitive substring or None
    """
    greedy, non_greedy = re.compile(r'((\w+)\2+)'), re.compile(r'((\w+?)\2+)')

    all_rep_seach = lambda regex: \
        (regex.search(sequence[shift:]) for shift in range(len(sequence)))

    searched = list(
        res.groups()
        for res in chain(all_rep_seach(greedy), all_rep_seach(non_greedy))
        if res)

    if not sequence:
        return None

    cmp_key = lambda res: res[0].count(res[1]) if len(res[1]) >= rep_min_len else 0
    return max(searched, key=cmp_key)[0]

You can test it like so:

def check(seq, expected, rep_min_len=1):
    result = repetitive(seq, rep_min_len)
    print('%s => %s' % (seq, result))
    assert result == expected, expected


check('asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs', 'UBAUBAUBAUBAUBA')
check('some noisy spacerABABABABABsome noisy spacer_ABCDEF_ABCDEFsome noisy spacerABABAB', 'ABABABABAB')
check('aaabcabc', 'aaa')
check('aaabcabc', 'abcabc', rep_min_len=2)
check('ababcababc', 'ababcababc')
check('ababcababcababc', 'ababcababcababc')

Key features:

used greedy ((\w+)\2+) and non-greedy ((\w+)\2+?) regex;
search repetitive substring in all substrings with the shift from the beginning (e.g.'string' => ['string', 'tring', 'ring', 'ing', 'ng', 'g']);
selection is based on the number of repetitions not on the length of subsequence (e.g. for 'ABABABAB_ABCDEF_ABCDEF' result will be 'ABABABAB', not '_ABCDEF_ABCDEF');
the minimum length of a repeating sequence is matters (see 'aaabcabc' check).

RomanPerekhrest · Answer

With combination of re.findall() (using specific regex patten) and max() functions:

import re

#  extended sample string
s = 'asdfewfUBAUBAUBAUBAUBAasdkjnfencsADADADAD sometext'

def find_longest_rep(s):
    result = max(re.findall(r'((\w+?)\2+)', s), key=lambda t: len(t[0]))
    return result[0]

print(find_longest_rep(s))

The output:

UBAUBAUBAUBAUBA

The crucial pattern:

((\w+?)\2+):
- (....) - the outermost captured group which is the 1st captured group
- (\w+?) - any non-whitespace character sequence enclosed into the 2nd captured group; +? - quantifier, matches between one and unlimited times, as few times as possible, expanding as needed
- \2+ - matches the same text as most recently matched by the 2nd capturing group

M. Jajeh · Answer

What you are searching for is an algorithm to find the 'largest' primitive tandem repeat in a string. Here is a paper describing a linear time algorithm to find all tandem repeats in a string and by extension all primitive tandem repeats. Gusfield. Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String

Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

Tags:

python

string

regex

algorithm

python-3.x

alec_djinn

3 Answers

Serhii Kostel

RomanPerekhrest

M. Jajeh

Recent Activity

Donate For Us

Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

Tags:

python

string

regex

algorithm

python-3.x

alec_djinn

3 Answers

Serhii Kostel

RomanPerekhrest

M. Jajeh

Related questions

Recent Activity

Donate For Us