How do I encode my hyphenated fasta format string to group all consecutive Nucleotide and hyphens and encode them as run length.
Consider my sequence as "ATGC----CGCTA-----G---". The string has sequence of Nucleotide followed by sequence of hyphens. I am trying to group all consecutive Nucleotide as the letter M
and consecutive hyphens as letter D
and prefix it with the size of the sub sequence.
The final result out of this encoding should be 4M4D5M5D1M3D
.
The following pictorial graphic explains it further
ATGC----CGCTA-----G---
| | | | | |
V V V V V V
4M 4D 5M 5D 1M 3D
When I use Counter
or list.count()
, I get "M":10 "D":12
:
from collections import Counter
seq="ATGC----CGCTA-----G---"
M=0
D=0
cigar=[]
for char in seq:
if char.isalpha():
M+=1
cigar.append("M")
else:
D+=1
cigar.append("D")
print Counter(cigar)
This problem is ideal for itertools.groupby
Implementation
from itertools import groupby
''.join('{}{}'.format(len(list(g)), 'DM'[k])
for k, g in groupby(seq, key = str.isalpha))
Output '4M4D5M5D1M3D'
Explanation
Notably, the key function is crucial here. Group the sequence based on it being an alphabet or not. Once done, it should be straight forward to count the size of each group and figure out the type of the group from the key element.
Some explanation of code
'DM'[k]
: This is just a nifty way of representing "M" if k == True else "D"
len(list(g))
: Determines the size of each group. Alternatively, it could have been written as sum(1 for e in g)
'{}{}'.format
: String formatting to create a concatenation of the consecutive frequency and the type''.join(
: To join the list elements as a string sequence.import re
seq='ATGC----CGCTA-----G---'
output = ''
for section in re.split('(-*)', seq):
if section.isalpha():
output += str(len(section)) + 'M'
elif section !='':
output += str(len(section)) + 'D'
print output
classical approach :
seq="ATGC----CGCTA-----G---"
def MD(c):
if c.isalpha():return "M"
else : return "D"
count=1
string=""
for i in range(len(seq)-1):
if MD(seq[i])==MD(seq[i+1]): count+=1
else:
string=string+str(count)+MD(seq[i])
count=1
string=string+str(count)+MD(seq[-1])
print string
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With