Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find inverted repeated pattern in a FASTA sequence?

Tags:

python

fasta

Suppose my long sequence looks like:

5’-AGGGTTTCCC**TGACCT**TCACTGC**AGGTCA**TGCA-3

The two italics subsequences (here within the two stars) in this long sequence are together called as inverted repeat pattern. The length and the combination of the four letters such as A,T,G,C in those two subsequences will be varying. But there is a relation between these two subsequence. Notice that, when you consider the first subsequence then its complementary subsequence is ACTGGA (according to A combines with T and G combine with C) and when you invert this complementary subsequence (i.e. last letter comes first) then it matches with the second subsequence.

There are large number of such patterns present in a FASTA sequence (contains 10 million ATGC letters) and I want to find such patterns and their start and end positions.

like image 750
user1964587 Avatar asked Jan 12 '13 21:01

user1964587


People also ask

What is an inverted repeat sequence?

An inverted repeat (or IR) is a single stranded sequence of nucleotides followed downstream by its reverse complement. [1] The intervening sequence of nucleotides between the initial sequence and the reverse complement can be any length including zero. For example, 5'---TTACGnnnnnnCGTAA---3' is an inverted repeat sequence.

What is the FASTA format for nucleotide sequencing?

FASTA Format for Nucleotide Sequences In FASTA format the line before the nucleotide sequence, called the FASTA definition line, must begin with a carat (">"), followed by a unique SeqID (sequence identifier). The SeqID must be unique for each nucleotide sequence and should not contain any spaces. Please limit the SeqID to 25 characters or less.

How long should the line after the FASTA definition line be?

The line after the FASTA definition line begins the nucleotide sequence. Unlike the FASTA definition line, the nucleotide sequence itself can contain returns. It is recommended that each line of sequence be no longer than 80 characters. Please only use IUPAC symbols within the nucleotide sequence.

What is the difference between DNA and FASTA?

›DNA.new [organism=Homo sapiens] [chromosome=17] [map=17q21] [moltype=mRNA] Homo sapiens breast and ovarian cancer susceptibility protein (BRCA1) mRNA, complete cds. The line after the FASTA definition line begins the nucleotide sequence. Unlike the FASTA definition line, the nucleotide sequence itself can contain returns.


1 Answers

I'm new to both Python and bioinformatics, but I'm working my way through the rosalind.info web site to learn some of both. You do this with a suffix tree. A suffix tree (see http://en.wikipedia.org/wiki/Suffix_tree) is the magical data structure that makes all things possible in bioinformatics. You quickly locate common substrings in multiple long sequences. Suffix trees only require linear time, so length 10,000,000 is feasible.

So first find the reverse complement of the sequence. Then put both into the suffix tree, and find the common substrings between them (of some minimum length).

The code below uses this suffix tree implementation: http://www.daimi.au.dk/~mailund/suffix_tree.html. It's written in C with Python bindings. It won't handle a large number of sequences, but two is no problem. However I can't say whether this will handle length 10,000,000.

from suffix_tree import GeneralisedSuffixTree

baseComplement = { 'A' : 'T', 'C' : 'G', 'G' : 'C', 'T' : 'A' }

def revc(seq):
    return "".join([baseComplement[base] for base in seq[::-1]])

data = "AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA"
# revc  TGCATGACCTGCAGTGAAGGTCAGGGAAACCCT
#       012345678901234567890123456789012
#                 1         2         3
minlength = 6

n = len(data)
tree = GeneralisedSuffixTree([data, revc(data)])
for shared in tree.sharedSubstrings(minlength):
    #print shared
    _, start, stop = shared[0]
    seq = data[start:stop]
    _, rstart, rstop = shared[1]
    rseq = data[n-rstop:n-rstart]
    print "Match: {0} at [{1}:{2}] and {3} at [{4}:{5}]".format(seq, start, stop, rseq, n-rstop, n-rstart)

This produces output

Match: AGGTCA at [23:29] and TGACCT at [10:16]
Match: TGACCT at [10:16] and AGGTCA at [23:29]
Match: CTGCAG at [19:25] and CTGCAG at [19:25]

It finds each match twice, once from each end. And there's a reverse palindrome in there, too!

like image 58
Carl Raymond Avatar answered Nov 04 '22 02:11

Carl Raymond