How to find inverted repeated pattern in a FASTA sequence?

Q: What is the difference between DNA and FASTA?

&rsaquo;DNA.new [organism=Homo sapiens] [chromosome=17] [map=17q21] [moltype=mRNA] Homo sapiens breast and ovarian cancer susceptibility protein (BRCA1) mRNA, complete cds. The line after the FASTA definition line begins the nucleotide sequence. Unlike the FASTA definition line, the nucleotide sequence itself can contain returns.

Tags:

python

fasta

Suppose my long sequence looks like:

Click to copy

5’-AGGGTTTCCC**TGACCT**TCACTGC**AGGTCA**TGCA-3

The two italics subsequences (here within the two stars) in this long sequence are together called as inverted repeat pattern. The length and the combination of the four letters such as A,T,G,C in those two subsequences will be varying. But there is a relation between these two subsequence. Notice that, when you consider the first subsequence then its complementary subsequence is ACTGGA (according to A combines with T and G combine with C) and when you invert this complementary subsequence (i.e. last letter comes first) then it matches with the second subsequence.

There are large number of such patterns present in a FASTA sequence (contains 10 million ATGC letters) and I want to find such patterns and their start and end positions.

750

asked Jan 12 '13 21:01

user1964587

1 Answers

I'm new to both Python and bioinformatics, but I'm working my way through the rosalind.info web site to learn some of both. You do this with a suffix tree. A suffix tree (see http://en.wikipedia.org/wiki/Suffix_tree) is the magical data structure that makes all things possible in bioinformatics. You quickly locate common substrings in multiple long sequences. Suffix trees only require linear time, so length 10,000,000 is feasible.

So first find the reverse complement of the sequence. Then put both into the suffix tree, and find the common substrings between them (of some minimum length).

The code below uses this suffix tree implementation: http://www.daimi.au.dk/~mailund/suffix_tree.html. It's written in C with Python bindings. It won't handle a large number of sequences, but two is no problem. However I can't say whether this will handle length 10,000,000.

Click to copy

from suffix_tree import GeneralisedSuffixTree

baseComplement = { 'A' : 'T', 'C' : 'G', 'G' : 'C', 'T' : 'A' }

def revc(seq):
    return "".join([baseComplement[base] for base in seq[::-1]])

data = "AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA"
# revc  TGCATGACCTGCAGTGAAGGTCAGGGAAACCCT
#       012345678901234567890123456789012
#                 1         2         3
minlength = 6

n = len(data)
tree = GeneralisedSuffixTree([data, revc(data)])
for shared in tree.sharedSubstrings(minlength):
    #print shared
    _, start, stop = shared[0]
    seq = data[start:stop]
    _, rstart, rstop = shared[1]
    rseq = data[n-rstop:n-rstart]
    print "Match: {0} at [{1}:{2}] and {3} at [{4}:{5}]".format(seq, start, stop, rseq, n-rstop, n-rstart)

This produces output

Click to copy

Match: AGGTCA at [23:29] and TGACCT at [10:16]
Match: TGACCT at [10:16] and AGGTCA at [23:29]
Match: CTGCAG at [19:25] and CTGCAG at [19:25]

It finds each match twice, once from each end. And there's a reverse palindrome in there, too!

answered Nov 04 '22 02:11

Carl Raymond

Related questions
                            
                                How to use nested transaction with scoped session in SQLAlchemy?
                            
                                Efficient way to check existence in a large set of strings
                            
                                How to create multidimensional array with numpy.mgrid
                            
                                Python minidom/xml : How to set node text with minidom api
                            
                                openpyxl please do not assume text as a number when importing
                            
                                Image Comparison for vector images (based on edge detection)?
                            
                                How to change the layout of a Gtk application on fullscreen?
                            
                                boost python method calls with reference arguments
                            
                                Boost.Python - Passing boost::python::object as argument to python function?
                            
                                How to import a module but ignoring the package's __init__.py?
                            
                                Get EXIF data without downloading whole image - Python
                            
                                Python: How can I get a list of function names from within __getattr__ function?
                            
                                How do I see if a domain uses DNSSEC
                            
                                Is there a python template library that can do "partial renderings"?
                            
                                OneToOneField with null=True doesn't allow empty field
                            
                                Adaptive plotting of a function in python
                            
                                Python permutations threads
                            
                                Global paster command not found in virtualenv
                            
                                GridSearch for Multilabel OneVsRestClassifier?
                            
                                Capture the error output of a foreground command using plumbum

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find inverted repeated pattern in a FASTA sequence?

Tags:

python

fasta

user1964587

People also ask

1 Answers

Carl Raymond

Recent Activity

Donate For Us