Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to extend ambiguous dna sequence

Let's say you have a DNA sequence like this :

AATCRVTAA

where R and V are ambiguous values of DNA nucleotides, where R represents either A or G and V represents A, C or G.

Is there a Biopython method to generate all the different combinations of sequences that could be represented by the above ambiguous sequence ?

Here for instance, the output would be :

AATCAATAA
AATCACTAA
AATCAGTAA
AATCGATAA
AATCGCTAA
AATCGGTAA
like image 585
jrjc Avatar asked Dec 18 '14 17:12

jrjc


People also ask

What is ambiguous DNA?

Ambiguous nucleotides are used when the true nucleotide is unknown. The most popular one seen with high-throughput genome sequences is "N" which represents any possible nucleotide.

What do NS mean in DNA sequence?

Correction of sequence-dependent ambiguous bases (Ns) from the 454 pyrosequencing system - PMC. An official website of the United States government. Here's how you know. The . gov means it's official.


2 Answers

Perhaps a little shorter and faster way, since by all odds this function is going to be used on very large data:

from Bio import Seq
from itertools import product

def extend_ambiguous_dna(seq):
   """return list of all possible sequences given an ambiguous DNA input"""
   d = Seq.IUPAC.IUPACData.ambiguous_dna_values
   return [ list(map("".join, product(*map(d.get, seq)))) ]

Using map allows your loops to be executed in C rather than in Python. This should prove much faster than using plain loops or even list comprehensions.

Field testing

With a simple dict as d instead of the one returned by ambiguous_na_values

from itertools import product
import time

d = { "N": ["A", "G", "T", "C"], "R": ["C", "A", "T", "G"] }
seq = "RNRN"

# using list comprehensions
lst_start = time.time()
[ "".join(i) for i in product(*[ d[j] for j in seq ]) ]
lst_end = time.time()

# using map
map_start = time.time()
[ list(map("".join, product(*map(d.get, seq)))) ]
map_end = time.time()

lst_delay = (lst_end - lst_start) * 1000
map_delay = (map_end - map_start) * 1000

print("List delay: {} ms".format(round(lst_delay, 2)))
print("Map delay: {} ms".format(round(map_delay, 2)))

Outputs:

# len(seq) = 2:
List delay: 0.02 ms
Map delay: 0.01 ms

# len(seq) = 3:
List delay: 0.04 ms
Map delay: 0.02 ms

# len(seq) = 4
List delay: 0.08 ms
Map delay: 0.06 ms

# len(seq) = 5
List delay: 0.43 ms
Map delay: 0.17 ms

# len(seq) = 10
List delay: 126.68 ms
Map delay: 77.15 ms

# len(seq) = 12
List delay: 1887.53 ms
Map delay: 1320.49 ms

Clearly map is better, but just by a factor of 2 or 3. It's certain it could be further optimised.

like image 93
Jivan Avatar answered Oct 21 '22 05:10

Jivan


I eventually write my own function :

from Bio import Seq
from itertools import product

def extend_ambiguous_dna(seq):
   """return list of all possible sequences given an ambiguous DNA input"""
   d = Seq.IUPAC.IUPACData.ambiguous_dna_values
   r = []
   for i in product(*[d[j] for j in seq]):
      r.append("".join(i))
   return r 

In [1]: extend_ambiguous_dna("AV")
Out[1]: ['AA', 'AC', 'AG']

It allows you to generate every pattern for a given size with

In [2]: extend_ambiguous_dna("NN")

Out[2]: ['GG', 'GA', 'GT', 'GC',
         'AG', 'AA', 'AT', 'AC',
         'TG', 'TA', 'TT', 'TC',
         'CG', 'CA', 'CT', 'CC']

Hope this will save time to others !

like image 39
jrjc Avatar answered Oct 21 '22 06:10

jrjc