Suppose my long sequence looks like:
5’-AGGGTTTCCC**TGACCT**TCACTGC**AGGTCA**TGCA-3
The two italics subsequences (here within the two stars) in this long sequence are together called as inverted repeat pattern. The length and the combination of the four letters such as A,T,G,C in those two subsequences will be varying. But there is a relation between these two subsequence. Notice that, when you consider the first subsequence then its complementary subsequence is ACTGGA (according to A combines with T and G combine with C) and when you invert this complementary subsequence (i.e. last letter comes first) then it matches with the second subsequence.
There are large number of such patterns present in a FASTA sequence (contains 10 million ATGC letters) and I want to find such patterns and their start and end positions.
An inverted repeat (or IR) is a single stranded sequence of nucleotides followed downstream by its reverse complement. [1] The intervening sequence of nucleotides between the initial sequence and the reverse complement can be any length including zero. For example, 5'---TTACGnnnnnnCGTAA---3' is an inverted repeat sequence.
FASTA Format for Nucleotide Sequences In FASTA format the line before the nucleotide sequence, called the FASTA definition line, must begin with a carat (">"), followed by a unique SeqID (sequence identifier). The SeqID must be unique for each nucleotide sequence and should not contain any spaces. Please limit the SeqID to 25 characters or less.
The line after the FASTA definition line begins the nucleotide sequence. Unlike the FASTA definition line, the nucleotide sequence itself can contain returns. It is recommended that each line of sequence be no longer than 80 characters. Please only use IUPAC symbols within the nucleotide sequence.
›DNA.new [organism=Homo sapiens] [chromosome=17] [map=17q21] [moltype=mRNA] Homo sapiens breast and ovarian cancer susceptibility protein (BRCA1) mRNA, complete cds. The line after the FASTA definition line begins the nucleotide sequence. Unlike the FASTA definition line, the nucleotide sequence itself can contain returns.
I'm new to both Python and bioinformatics, but I'm working my way through the rosalind.info web site to learn some of both. You do this with a suffix tree. A suffix tree (see http://en.wikipedia.org/wiki/Suffix_tree) is the magical data structure that makes all things possible in bioinformatics. You quickly locate common substrings in multiple long sequences. Suffix trees only require linear time, so length 10,000,000 is feasible.
So first find the reverse complement of the sequence. Then put both into the suffix tree, and find the common substrings between them (of some minimum length).
The code below uses this suffix tree implementation: http://www.daimi.au.dk/~mailund/suffix_tree.html. It's written in C with Python bindings. It won't handle a large number of sequences, but two is no problem. However I can't say whether this will handle length 10,000,000.
from suffix_tree import GeneralisedSuffixTree
baseComplement = { 'A' : 'T', 'C' : 'G', 'G' : 'C', 'T' : 'A' }
def revc(seq):
return "".join([baseComplement[base] for base in seq[::-1]])
data = "AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA"
# revc TGCATGACCTGCAGTGAAGGTCAGGGAAACCCT
# 012345678901234567890123456789012
# 1 2 3
minlength = 6
n = len(data)
tree = GeneralisedSuffixTree([data, revc(data)])
for shared in tree.sharedSubstrings(minlength):
#print shared
_, start, stop = shared[0]
seq = data[start:stop]
_, rstart, rstop = shared[1]
rseq = data[n-rstop:n-rstart]
print "Match: {0} at [{1}:{2}] and {3} at [{4}:{5}]".format(seq, start, stop, rseq, n-rstop, n-rstart)
This produces output
Match: AGGTCA at [23:29] and TGACCT at [10:16]
Match: TGACCT at [10:16] and AGGTCA at [23:29]
Match: CTGCAG at [19:25] and CTGCAG at [19:25]
It finds each match twice, once from each end. And there's a reverse palindrome in there, too!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With