Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

I am looking for an algorithm (possibly implemented in Python) able to find the most REPETITIVE sequence in a string. Where for REPETITIVE, I mean any combination of chars that is repeated over and over without interruption (tandem repeat).

The algorithm I am looking for is not the same as the "find the most common word" one. In fact, the repetitive block doesn't need to be the most common word (substring) in the string.

For example:

s = 'asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs'
> f(s)
'UBAUBAUBAUBAUBA' #the "most common word" algo would return 'BA'

Unfortunately, I have no idea on how to tackle this. Any help is very welcome.


UPDATE

A little extra example to clarify that I want to be returned the sequence with the most number of repetition, whatever its basic building block is.

g = 'some noisy spacer'
s = g + 'AB'*5 + g + '_ABCDEF'*2 + g + 'AB'*3
> f(s)
'ABABABABAB' #the one with the most repetitions, not the max len

Examples from @rici:

s = 'aaabcabc'
> f(s)
'abcabc'

s = 'ababcababc'
> f(s)
'ababcababc' #'abab' would also be a solution here
             # since it is repeated 2 times in a row as 'ababcababc'.
             # The proper algorithm would return both solutions.
like image 295
alec_djinn Avatar asked Feb 19 '18 16:02

alec_djinn


3 Answers

Here is the solution based on ((\w+?)\2+) regex but with additional improvements:

import re
from itertools import chain


def repetitive(sequence, rep_min_len=1):
    """Find the most repetitive sequence in a string.

    :param str sequence: string for search
    :param int rep_min_len: minimal length of repetitive substring
    :return the most repetitive substring or None
    """
    greedy, non_greedy = re.compile(r'((\w+)\2+)'), re.compile(r'((\w+?)\2+)')

    all_rep_seach = lambda regex: \
        (regex.search(sequence[shift:]) for shift in range(len(sequence)))

    searched = list(
        res.groups()
        for res in chain(all_rep_seach(greedy), all_rep_seach(non_greedy))
        if res)

    if not sequence:
        return None

    cmp_key = lambda res: res[0].count(res[1]) if len(res[1]) >= rep_min_len else 0
    return max(searched, key=cmp_key)[0]

You can test it like so:

def check(seq, expected, rep_min_len=1):
    result = repetitive(seq, rep_min_len)
    print('%s => %s' % (seq, result))
    assert result == expected, expected


check('asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs', 'UBAUBAUBAUBAUBA')
check('some noisy spacerABABABABABsome noisy spacer_ABCDEF_ABCDEFsome noisy spacerABABAB', 'ABABABABAB')
check('aaabcabc', 'aaa')
check('aaabcabc', 'abcabc', rep_min_len=2)
check('ababcababc', 'ababcababc')
check('ababcababcababc', 'ababcababcababc')

Key features:

  1. used greedy ((\w+)\2+) and non-greedy ((\w+)\2+?) regex;
  2. search repetitive substring in all substrings with the shift from the beginning (e.g.'string' => ['string', 'tring', 'ring', 'ing', 'ng', 'g']);
  3. selection is based on the number of repetitions not on the length of subsequence (e.g. for 'ABABABAB_ABCDEF_ABCDEF' result will be 'ABABABAB', not '_ABCDEF_ABCDEF');
  4. the minimum length of a repeating sequence is matters (see 'aaabcabc' check).
like image 181
Serhii Kostel Avatar answered Oct 22 '22 11:10

Serhii Kostel


With combination of re.findall() (using specific regex patten) and max() functions:

import re

#  extended sample string
s = 'asdfewfUBAUBAUBAUBAUBAasdkjnfencsADADADAD sometext'

def find_longest_rep(s):
    result = max(re.findall(r'((\w+?)\2+)', s), key=lambda t: len(t[0]))
    return result[0]

print(find_longest_rep(s))

The output:

UBAUBAUBAUBAUBA

The crucial pattern:

  • ((\w+?)\2+):
    • (....) - the outermost captured group which is the 1st captured group
    • (\w+?) - any non-whitespace character sequence enclosed into the 2nd captured group; +? - quantifier, matches between one and unlimited times, as few times as possible, expanding as needed
    • \2+ - matches the same text as most recently matched by the 2nd capturing group
like image 13
RomanPerekhrest Avatar answered Oct 22 '22 10:10

RomanPerekhrest


What you are searching for is an algorithm to find the 'largest' primitive tandem repeat in a string. Here is a paper describing a linear time algorithm to find all tandem repeats in a string and by extension all primitive tandem repeats. Gusfield. Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String

like image 4
M. Jajeh Avatar answered Oct 22 '22 10:10

M. Jajeh