Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting consecutive alphabets and hyphens and encode them as run length

How do I encode my hyphenated fasta format string to group all consecutive Nucleotide and hyphens and encode them as run length.

Consider my sequence as "ATGC----CGCTA-----G---". The string has sequence of Nucleotide followed by sequence of hyphens. I am trying to group all consecutive Nucleotide as the letter M and consecutive hyphens as letter D and prefix it with the size of the sub sequence.

The final result out of this encoding should be 4M4D5M5D1M3D.

The following pictorial graphic explains it further

ATGC----CGCTA-----G---
 |   |    |    |  |  |
 V   V    V    V  V  V
4M   4D  5M    5D 1M 3D

When I use Counter or list.count(), I get "M":10 "D":12:

from collections import Counter

seq="ATGC----CGCTA-----G---"

M=0
D=0   

cigar=[]

for char in seq:    
    if char.isalpha():
        M+=1
        cigar.append("M")   
    else:
        D+=1
        cigar.append("D")

print Counter(cigar)
like image 376
gthm Avatar asked Apr 25 '15 18:04

gthm


3 Answers

This problem is ideal for itertools.groupby

Implementation

from itertools import groupby
''.join('{}{}'.format(len(list(g)), 'DM'[k]) 
        for k, g in groupby(seq, key = str.isalpha))

Output '4M4D5M5D1M3D'

Explanation

Notably, the key function is crucial here. Group the sequence based on it being an alphabet or not. Once done, it should be straight forward to count the size of each group and figure out the type of the group from the key element.

Some explanation of code

  • 'DM'[k]: This is just a nifty way of representing "M" if k == True else "D"
  • len(list(g)): Determines the size of each group. Alternatively, it could have been written as sum(1 for e in g)
  • '{}{}'.format: String formatting to create a concatenation of the consecutive frequency and the type
  • ''.join(: To join the list elements as a string sequence.
like image 128
Abhijit Avatar answered Nov 15 '22 21:11

Abhijit


import re
seq='ATGC----CGCTA-----G---'

output = ''
for section in re.split('(-*)', seq):
    if section.isalpha():
        output += str(len(section)) + 'M'
    elif section !='':
        output += str(len(section)) + 'D'
print output
like image 32
jeff_carter Avatar answered Nov 15 '22 21:11

jeff_carter


classical approach :

seq="ATGC----CGCTA-----G---"

def MD(c):
    if c.isalpha():return "M"
    else : return "D"

count=1
string=""
for i in range(len(seq)-1):
    if MD(seq[i])==MD(seq[i+1]): count+=1
    else: 
        string=string+str(count)+MD(seq[i])
        count=1
string=string+str(count)+MD(seq[-1])
print string
like image 41
B. M. Avatar answered Nov 15 '22 21:11

B. M.