how to extend ambiguous dna sequence

Tags:

Let's say you have a DNA sequence like this :

AATCRVTAA

where R and V are ambiguous values of DNA nucleotides, where R represents either A or G and V represents A, C or G.

Is there a Biopython method to generate all the different combinations of sequences that could be represented by the above ambiguous sequence ?

Here for instance, the output would be :

AATCAATAA
AATCACTAA
AATCAGTAA
AATCGATAA
AATCGCTAA
AATCGGTAA

585

asked Dec 18 '14 17:12

jrjc

2 Answers

Perhaps a little shorter and faster way, since by all odds this function is going to be used on very large data:

from Bio import Seq
from itertools import product

def extend_ambiguous_dna(seq):
   """return list of all possible sequences given an ambiguous DNA input"""
   d = Seq.IUPAC.IUPACData.ambiguous_dna_values
   return [ list(map("".join, product(*map(d.get, seq)))) ]

Using map allows your loops to be executed in C rather than in Python. This should prove much faster than using plain loops or even list comprehensions.

Field testing

With a simple dict as d instead of the one returned by ambiguous_na_values

from itertools import product
import time

d = { "N": ["A", "G", "T", "C"], "R": ["C", "A", "T", "G"] }
seq = "RNRN"

# using list comprehensions
lst_start = time.time()
[ "".join(i) for i in product(*[ d[j] for j in seq ]) ]
lst_end = time.time()

# using map
map_start = time.time()
[ list(map("".join, product(*map(d.get, seq)))) ]
map_end = time.time()

lst_delay = (lst_end - lst_start) * 1000
map_delay = (map_end - map_start) * 1000

print("List delay: {} ms".format(round(lst_delay, 2)))
print("Map delay: {} ms".format(round(map_delay, 2)))

Outputs:

# len(seq) = 2:
List delay: 0.02 ms
Map delay: 0.01 ms

# len(seq) = 3:
List delay: 0.04 ms
Map delay: 0.02 ms

# len(seq) = 4
List delay: 0.08 ms
Map delay: 0.06 ms

# len(seq) = 5
List delay: 0.43 ms
Map delay: 0.17 ms

# len(seq) = 10
List delay: 126.68 ms
Map delay: 77.15 ms

# len(seq) = 12
List delay: 1887.53 ms
Map delay: 1320.49 ms

Clearly map is better, but just by a factor of 2 or 3. It's certain it could be further optimised.

answered Oct 21 '22 05:10

Jivan

I eventually write my own function :

from Bio import Seq
from itertools import product

def extend_ambiguous_dna(seq):
   """return list of all possible sequences given an ambiguous DNA input"""
   d = Seq.IUPAC.IUPACData.ambiguous_dna_values
   r = []
   for i in product(*[d[j] for j in seq]):
      r.append("".join(i))
   return r 

In [1]: extend_ambiguous_dna("AV")
Out[1]: ['AA', 'AC', 'AG']

It allows you to generate every pattern for a given size with

In [2]: extend_ambiguous_dna("NN")

Out[2]: ['GG', 'GA', 'GT', 'GC',
         'AG', 'AA', 'AT', 'AC',
         'TG', 'TA', 'TT', 'TC',
         'CG', 'CA', 'CT', 'CC']

Hope this will save time to others !

answered Oct 21 '22 06:10

jrjc

Related questions
                            
                                Python Truncating to 32-bit
                            
                                Easy, scriptable way to sub-sample unstructured THREDDS data?
                            
                                Search and replace text odfpy
                            
                                ImportError: cannot import name 'certs'
                            
                                Pydev debugger: Unable to find module to reload
                            
                                Does Python optimize dictionary lookups under the hood?
                            
                                Python coverage.py exclude_lines
                            
                                HTTP METHOD categorization cancel vs. delete
                            
                                Python: Visualize a normal curve on data's histogram
                            
                                regex: getting backreference to number, adding to it
                            
                                Schematron validation with lxml in Python: how to retrieve validation errors?
                            
                                Modifying widget colour at runtime without overwriting stylesheet
                            
                                force pandas to read nan as string
                            
                                How does this Python import work?
                            
                                How to print over raw_input's line in Python?
                            
                                /usr/bin/python vs /opt/local/bin/python2.7 on OS X
                            
                                Find all strings that are in between two sub strings
                            
                                PyOpenGL ubuntu 14.04: undefined function error
                            
                                How to "pretty print" a python pandas DatetimeIndex
                            
                                Python: pip tries to install to /bin directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to extend ambiguous dna sequence

Tags:

python

dna-sequence

biopython

jrjc

People also ask

2 Answers

Field testing

Jivan

jrjc

Recent Activity

Donate For Us