This is a question regarding a more efficient code design:
Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.
# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.
# Desired output
align = {"seq1":"ATGCATGC",
"seq2":"ATGC",
"seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,1], "gene2":[2,3]},
"seq3":{"gene1":[0,1], "gene2":[2,3]}}
Obtaining the desired output is less trivial than it may seem. Below I wrote some (line-numbered) pseudocode for this problem, but surely there is a more elegant design.
1 measure length of any aligned gene # take any seq, since all seqs aligned
2 list_lengths = list of gene lengths # order is important
3 for seq in alignment
4 outseq = ""
5 for each num in range(0, length(seq)) # weird for-loop is intentional
6 if seq[num] == "-"
7 current_gene = gene whose start/stop positions include num
8 subtract 1 from length of current_gene
9 subtract 1 from lengths of all genes following current_gene in list_lengths
10 else
11 append seq[num] to outseq
12 append outseq to new variable
13 convert gene lengths into start/stop positions and append ordered to new variable
Can anyone give me suggestions/examples for a shorter, more direct code design?
If the DNA and protein alignments differ, the protein alignment will almost certainly be more accurate, so use proteins.
The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the 'signal-to-noise ratio' in protein sequence alignments is much better than in alignments of DNA.
Because both protein and RNA structure is more evolutionarily conserved than sequence, structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.
It is much more difficult and time-consuming to sequence proteins than DNA. Thanks to the genetic code, the protein sequence can be deduced from the DNA sequence (but not vice versa, because most amino acids are encoded by more than one codon, see earlier section).
This answer handles your updated annos
dictionary from the comment to cdlanes answer. That answer leaves the annos
dictionary with the incorrect index of [2,1] for seq2
gene2
. My proposed solution will remove the gene
entry from the dictionary if the sequence contains ALL gaps in that region. Also to note, if a gene contains only one letter in the final align
, then anno[geneX]
will have equal indices for start and stop --> See seq3
gene1
from your commented annos
.
align = {"seq1":"ATGCATGC",
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}
import re
for name,anno in annos.items():
# indices of gaps removed usinig re
removed = [(m.start(0)) for m in re.finditer(r'-', align[name])]
# removes gaps from align dictionary
align[name] = re.sub(r'-', '', align[name])
build_dna = ''
for gene,inds in anno.items():
start_ind = len(build_dna)+1
#generator to sum the num '-' removed from gene
num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1])
# build the de-gapped string
build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "")
end_ind = len(build_dna)
if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps
del annos[name][gene] #remove the gene entry
continue
#update the values in the annos dictionary
annos[name][gene][0] = start_ind-1
annos[name][gene][1] = end_ind-1
Results:
In [3]: annos
Out[3]: {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]},
'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
Results from the 3 gene annos
above. Just replace the annos
variable:
In [5]: annos3
Out[5]: {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]},
'seq2': {'gene1': [0, 1], 'gene3': [2, 3]},
'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}
The following matches the output of example program for both test cases:
align = {"seq1":"ATGCATGC",
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
(START, STOP) = (0, 1)
for alignment, sequence in align.items():
new_sequence = ''
gap = 0
for position, codon in enumerate(sequence):
if '-' == codon:
for gene in annos[alignment].values():
if gene[START] > (position - gap):
gene[START] -= 1
if gene[STOP] >= (position - gap):
gene[STOP] -= 1
gap += 1
else:
new_sequence += codon
align[alignment] = new_sequence
The result of running this:
python3 -i test.py
>>> align
{'seq2': 'ATGC', 'seq1': 'ATGCATGC', 'seq3': 'ACAC'}
>>>
>>> annos
{'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
>>>
I hope this is still elegant, direct, short and Pythonic enough!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With