Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FASTA Algorithm Explanation

I'm trying to understand the basic steps of FASTA algorithm in searching similar sequences of a query sequence in a database. These are the steps of the algorithm:

  1. Identify common k-words between I and J
  2. Score diagonals with k-word matches, identify 10 best diagonals
  3. Rescore initial regions with a substitution score matrix
  4. Join initial regions using gaps, penalise for gaps
  5. Perform dynamic programming to find final alignments

I'm confused with the 3rd and 4th steps in using PAM250 score matrix, and how to "join using gaps".

Can somebody explain these two steps for me "as specifically as possible". Thanks

like image 874
conmadoi Avatar asked Dec 03 '11 08:12

conmadoi


People also ask

How does Fasta algorithm work?

FASTA takes a given nucleotide or amino acid sequence and searches a corresponding sequence database by using local sequence alignment to find matches of similar database sequences. The FASTA program follows a largely heuristic method which contributes to the high speed of its execution.

What is bioinformatics Fasta algorithm?

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

What is FASTA sequence explain with example?

What is FASTA format? FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.

What is FASTA and BLAST algorithm?

BLAST and FASTA are two similarity searching programs that identify homologous DNA sequences and proteins based on the excess sequence similarity. The excess similarity between two DNA or amino acid sequences arises due to the common ancestry-homology.


1 Answers

This is how FASTA works:

  1. Find all k-length identities, then find locally similar regions by selecting those dense with k-word identities (i.e. many k-words, without too many gaps between). The best ten initial regions are used.
  2. The initial regions are re-scored along their lengths by applying a substitution matrix in the usual way. Optimally scoring subregions are identified.
  3. Create an alignment of the trimmed initial regions using dynamic programming, with a gap penalty of 20. Regions with too low of a score are not included.
  4. Optimize the alignment from 3) using "banded" dynamic programming (Smith-Waterman). This is dynamic programming restricted to the 32 residue-wide band around the original alignment, which saves space and time over full dynamic programming.

If there are insufficient initial regions to form an alignment in 3), the best score from 2) can be used to rank sequences by similarity. Scores from 3) and 4) can also be used for that purpose.

Unfortunately my institution doesn't have access to the original FASTA paper so I can't supply the original values of the various parameters mentioned above.

like image 191
reve_etrange Avatar answered Sep 18 '22 18:09

reve_etrange