I'm writing a python program which has to compute a numerical coding of mutated residues and positions of a set of strings.These strings are protein sequences.These sequences are stored in fasta format file and each protein sequence is separated by comma.The sequence lengths may differ for different protein.In this I tried to find the position and sequence which are mutated.
I used following code for getting this.
a = 'AGFESPKLH'
b = 'KGFEHMKLH'
for i in range(len(a)):
if a[i] != b[i]:
print i, a[i], b[i]
But I want the sequence file as input file.The following figure will tell about my project.In this figure first box represents alignment of input file sequences.The last box represents the output file. How can I do this in Python? please help me. Thank you for everyone for your time.
example:
input file
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
If you are to work with tabular data, consider pandas:
from pandas import *
data = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'
df = DataFrame([list(row) for row in data.split(',')])
print DataFrame({str(col)+val:(df[col]==val).apply(int)
for col in df.columns for val in set(df[col])})
output:
0M 1K 1T 2A 2S 3Q 4D 4E 4H 5D
0 1 0 1 1 0 1 1 0 0 1
1 1 0 1 1 0 1 1 0 0 1
2 1 0 1 0 1 1 0 1 0 1
3 1 0 1 1 0 1 1 0 0 1
4 1 1 0 1 0 1 0 0 1 1
If you want to drop the columns with all ones:
print df.select(lambda x: not df[x].all(), axis = 1)
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With