I'm trying to understand the basic steps of FASTA algorithm in searching similar sequences of a query sequence in a database. These are the steps of the algorithm: <ol> <li>Identify common k-words between I and J</li> <li>Score diagonals with k-word matches, identify 10 best diagonals</li> <li>Rescore initial regions with a substitution score matrix</li> <li>Join initial regions using gaps, penalise for gaps</li> <li>Perform dynamic programming to find final alignments</li> </ol> I'm confused with the 3rd and 4th steps in using PAM250 score matrix, and how to "join using gaps". Can somebody explain these two steps for me "as specifically as possible". Thanks

This is how FASTA works: <ol> <li>Find all k-length identities, then find locally similar regions by selecting those dense with k-word identities (i.e. many k-words, without too many gaps between). The best ten initial regions are used.</li> <li>The initial regions are re-scored along their lengths by applying a substitution matrix in the usual way. Optimally scoring subregions are identified.</li> <li>Create an alignment of the trimmed initial regions using dynamic programming, with a gap penalty of 20. Regions with too low of a score are not included.</li> <li>Optimize the alignment from 3) using "banded" dynamic programming (Smith-Waterman). This is dynamic programming restricted to the 32 residue-wide band around the original alignment, which saves space and time over full dynamic programming.</li> </ol> If there are insufficient initial regions to form an alignment in 3), the best score from 2) can be used to rank sequences by similarity. Scores from 3) and 4) can also be used for that purpose. Unfortunately my institution doesn't have access to the original FASTA paper so I can't supply the original values of the various parameters mentioned above.

FASTA Algorithm Explanation

1 Answers

This is how FASTA works:

Find all k-length identities, then find locally similar regions by selecting those dense with k-word identities (i.e. many k-words, without too many gaps between). The best ten initial regions are used.
The initial regions are re-scored along their lengths by applying a substitution matrix in the usual way. Optimally scoring subregions are identified.
Create an alignment of the trimmed initial regions using dynamic programming, with a gap penalty of 20. Regions with too low of a score are not included.
Optimize the alignment from 3) using "banded" dynamic programming (Smith-Waterman). This is dynamic programming restricted to the 32 residue-wide band around the original alignment, which saves space and time over full dynamic programming.

If there are insufficient initial regions to form an alignment in 3), the best score from 2) can be used to rank sequences by similarity. Scores from 3) and 4) can also be used for that purpose.

Unfortunately my institution doesn't have access to the original FASTA paper so I can't supply the original values of the various parameters mentioned above.

191

answered Sep 18 '22 18:09

reve_etrange

Related questions
                            
                                Draw a colored sphere from cartesian coordinates in pymol
                            
                                Counting DNA Nucleotides using perl 6
                            
                                "average length of the sequences in a fasta file": Can you improve this Erlang code?
                            
                                Perl: Removing duplicates from a large set of data
                            
                                Generating Synthetic DNA Sequence with Substitution Rate
                            
                                Changing the x-axis of seqlogo figures in MATLAB
                            
                                Splitting scientific names [closed]
                            
                                R indexing string with character blocks denoting nucleotide variants
                            
                                How to extract the first hit elements from an XML NCBI BLAST file?
                            
                                How to order rows by conditions in other columns in r?
                            
                                Validate DNA in C/C++
                            
                                Implementing the Waterman-Eggert algorithm
                            
                                scikit-bio extract genomic features from gff3 file
                            
                                Python - Iteration over nested lists
                            
                                Regex to Match mRNA Sequences
                            
                                Efficiently construct GRanges/IRanges from Rle vector
                            
                                mitosis of a human cell
                            
                                Perl Inline::C: Are Inline_Stack_Vars etc. needed to avoid memory leaks (biosequence character matching)
                            
                                Populate list with tuples
                            
                                multiFASTA file processing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

FASTA Algorithm Explanation

Tags:

bioinformatics

fasta

conmadoi

People also ask

1 Answers

reve_etrange

Recent Activity

Donate For Us