Counting consecutive alphabets and hyphens and encode them as run length

Question

How do I encode my hyphenated fasta format string to group all consecutive Nucleotide and hyphens and encode them as run length.

Consider my sequence as "ATGC----CGCTA-----G---". The string has sequence of Nucleotide followed by sequence of hyphens. I am trying to group all consecutive Nucleotide as the letter M and consecutive hyphens as letter D and prefix it with the size of the sub sequence.

The final result out of this encoding should be 4M4D5M5D1M3D.

The following pictorial graphic explains it further

ATGC----CGCTA-----G---
 |   |    |    |  |  |
 V   V    V    V  V  V
4M   4D  5M    5D 1M 3D

When I use Counter or list.count(), I get "M":10 "D":12:

from collections import Counter

seq="ATGC----CGCTA-----G---"

M=0
D=0   

cigar=[]

for char in seq:    
    if char.isalpha():
        M+=1
        cigar.append("M")   
    else:
        D+=1
        cigar.append("D")

print Counter(cigar)

Abhijit · Accepted Answer

This problem is ideal for itertools.groupby

Implementation

from itertools import groupby
''.join('{}{}'.format(len(list(g)), 'DM'[k]) 
        for k, g in groupby(seq, key = str.isalpha))

Output '4M4D5M5D1M3D'

Explanation

Notably, the key function is crucial here. Group the sequence based on it being an alphabet or not. Once done, it should be straight forward to count the size of each group and figure out the type of the group from the key element.

Some explanation of code

'DM'[k]: This is just a nifty way of representing "M" if k == True else "D"
len(list(g)): Determines the size of each group. Alternatively, it could have been written as sum(1 for e in g)
'{}{}'.format: String formatting to create a concatenation of the consecutive frequency and the type
''.join(: To join the list elements as a string sequence.

jeff_carter · Answer

import re
seq='ATGC----CGCTA-----G---'

output = ''
for section in re.split('(-*)', seq):
    if section.isalpha():
        output += str(len(section)) + 'M'
    elif section !='':
        output += str(len(section)) + 'D'
print output

B. M. · Answer

classical approach :

seq="ATGC----CGCTA-----G---"

def MD(c):
    if c.isalpha():return "M"
    else : return "D"

count=1
string=""
for i in range(len(seq)-1):
    if MD(seq[i])==MD(seq[i+1]): count+=1
    else: 
        string=string+str(count)+MD(seq[i])
        count=1
string=string+str(count)+MD(seq[-1])
print string

Counting consecutive alphabets and hyphens and encode them as run length

Tags:

python

collections

python-2.7

counter

gthm

3 Answers

Abhijit

jeff_carter

B. M.

Recent Activity

Donate For Us

Counting consecutive alphabets and hyphens and encode them as run length

Tags:

python

collections

python-2.7

counter

gthm

3 Answers

Abhijit

jeff_carter

B. M.

Related questions

Recent Activity

Donate For Us