Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A more complex version of "How can I tell if a string repeats itself in Python?"

I was reading this post and I wonder if someone can find the way to catch repetitive motifs into a more complex string.

For example, find all the repetitive motifs in

string = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

Here the repetitive motifs: 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

So, the output should be something like this:

output = {'ACGT': {'repeat': 2,
                   'region': (5,13)},
          'GT': {'repeat': 3,
                 'region': (19,24)},
          'TATACG': {'repeat': 2,
                     'region': (29,40)}}

This example comes from a typical biological phenomena termed microsatellite which is present into the DNA.

UPDATE 1: Asterisks were removed from the string variable. It was a mistake.

UPDATE 2: Single character motif doesn't count. For example: in ACGUGAAAGUC, the 'A' motif is not taken into account.

like image 801
Ivan Castro Avatar asked Apr 29 '15 21:04

Ivan Castro


1 Answers

you can use a recursion function as following :

Note: The result argument will be treated as a global variable (because passing mutable object to the function affects the caller)

import re
def finder(st,past_ind=0,result=[]):
   m=re.search(r'(.+)\1+',st)
   if m:
      i,j=m.span()
      sub=st[i:j]
      ind = (sub+sub).find(sub, 1)
      sub=sub[:ind]
      if len(sub)>1:
        result.append([sub,(i+past_ind+1,j+past_ind+1)])
      past_ind+=j
      return finder(st[j:],past_ind)
   else:
      return result



s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)

result:

[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]

answer to previous question for following string :

s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'

You can use both answers from mentioned question and some extra recipes :

First you can split the string with ** then create a new list contain the repeated strings with r'(.+)\1+' regex :

So the result will be :

>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']

Note about 'ACGTACGT' that missed the A at the end!

Then you can use principal_period's function to get the repeated sub strings :

def principal_period(s):
    i = (s+s).find(s, 1, -1)
    return None if i == -1 else s[:i]

>>> for i in new:
...    p=principal_period(i)
...    if p is not None and len(p)>1:
...        l.append(p)
...        sub.append(i)
... 

So you will have the repeated strings in l and main strings in sub :

>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']

Then you need a the region that you can do it with span method :

>>> for t in sub:
...    regons.append(re.search(t,s).span())

>>> regons
[(6, 14), (24, 30), (38, 50)]

And at last you can zip the 3 list regon,sub,l and use a dict comprehension to create the expected result :

>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}

The main code :

>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
...    p=principal_period(i)
...    if p is not None and len(p)>1:
...        l.append(p)
...        sub.append(i)
... 

>>> for t in sub:
...    regons.append(re.search(t,s).span())
... 
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
like image 113
Mazdak Avatar answered Oct 15 '22 19:10

Mazdak