Improving code design of DNA alignment degapping

This is a question regarding a more efficient code design:

Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.

# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}

I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.

# Desired output
align = {"seq1":"ATGCATGC", 
         "seq2":"ATGC",
         "seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,1], "gene2":[2,3]},
         "seq3":{"gene1":[0,1], "gene2":[2,3]}}

Obtaining the desired output is less trivial than it may seem. Below I wrote some (line-numbered) pseudocode for this problem, but surely there is a more elegant design.

1  measure length of any aligned gene  # take any seq, since all seqs aligned
2  list_lengths = list of gene lengths  # order is important
3  for seq in alignment
4      outseq = ""
5      for each num in range(0, length(seq))  # weird for-loop is intentional
6          if seq[num] == "-"
7              current_gene = gene whose start/stop positions include num
8              subtract 1 from length of current_gene
9              subtract 1 from lengths of all genes following current_gene in list_lengths
10         else
11             append seq[num] to outseq
12     append outseq to new variable
13     convert gene lengths into start/stop positions and append ordered to new variable

Can anyone give me suggestions/examples for a shorter, more direct code design?

Which would you expect would provide a better alignment the nucleotide MSA or the protein MSA Why?

If the DNA and protein alignments differ, the protein alignment will almost certainly be more accurate, so use proteins.

Which sequence alignment is better protein or nucleotide?

The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the 'signal-to-noise ratio' in protein sequence alignments is much better than in alignments of DNA.

Why is protein sequence alignment produce more reliable result than DNA sequence alignment?

Because both protein and RNA structure is more evolutionarily conserved than sequence, structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.

Are proteins easier to sequence than DNA?

It is much more difficult and time-consuming to sequence proteins than DNA. Thanks to the genetic code, the protein sequence can be deduced from the DNA sequence (but not vice versa, because most amino acids are encoded by more than one codon, see earlier section).

This answer handles your updated annos dictionary from the comment to cdlanes answer. That answer leaves the annos dictionary with the incorrect index of [2,1] for seq2 gene2. My proposed solution will remove the gene entry from the dictionary if the sequence contains ALL gaps in that region. Also to note, if a gene contains only one letter in the final align, then anno[geneX] will have equal indices for start and stop --> See seq3 gene1 from your commented annos.

align = {"seq1":"ATGCATGC",
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}


annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
          "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
          "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}

import re
for name,anno in annos.items():
    # indices of gaps removed usinig re
    removed = [(m.start(0)) for m in re.finditer(r'-', align[name])]

    # removes gaps from align dictionary
    align[name] = re.sub(r'-', '', align[name])

    build_dna = ''
    for gene,inds in anno.items():

        start_ind = len(build_dna)+1

        #generator to sum the num '-' removed from gene
        num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1])

        # build the de-gapped string
        build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "")

        end_ind = len(build_dna)

        if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps
            del annos[name][gene] #remove the gene entry
            continue
        #update the values in the annos dictionary
        annos[name][gene][0] = start_ind-1
        annos[name][gene][1] = end_ind-1

Results:

In [3]: annos
Out[3]:  {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]},
          'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
          'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}

Results from the 3 gene annos above. Just replace the annos variable:

In [5]: annos3
Out[5]:  {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]},
          'seq2': {'gene1': [0, 1], 'gene3': [2, 3]},
          'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}

The following matches the output of example program for both test cases:

align = {"seq1":"ATGCATGC",
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}

(START, STOP) = (0, 1)

for alignment, sequence in align.items():
    new_sequence = ''
    gap = 0

    for position, codon in enumerate(sequence):
        if '-' == codon:
            for gene in annos[alignment].values():
                if gene[START] > (position - gap):
                    gene[START] -= 1
                if gene[STOP] >= (position - gap):
                    gene[STOP] -= 1
            gap += 1
        else:
            new_sequence += codon

    align[alignment] = new_sequence

The result of running this:

python3 -i test.py
>>> align
{'seq2': 'ATGC', 'seq1': 'ATGCATGC', 'seq3': 'ACAC'}
>>> 
>>> annos
{'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
>>>

I hope this is still elegant, direct, short and Pythonic enough!

Improving code design of DNA alignment degapping

Tags:

python

algorithm

optimization

dna-sequence

Michael G

People also ask

2 Answers

Kevin

cdlane

Recent Activity

Donate For Us

Improving code design of DNA alignment degapping

Tags:

python

algorithm

optimization

dna-sequence

Michael G

People also ask

2 Answers

Kevin

cdlane

Related questions

Recent Activity

Donate For Us