multiFASTA file processing

Tags:

I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots. Also an R BIoconductor solution or a BioPerl module would do, but I didn't manage to find anything.

Can you help me? Thanks a lot :-)

974

asked Nov 24 '09 10:11

Federico Giorgi

3 Answers

Some of the emboss tools are a collection of small tools that can help you out.

seqstats returns sequence length
pepstats should give you aminoacid content etc. Some of the tools also offer plotting functions. Very handy. http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/groups.html

To count number of fasta entries, I use: grep -c '^>' mySequences.fasta.

To make sure none of the entries are duplicate, I check that I get the same number when doing this: grep '^>' mySequences.fasta | sort | uniq | wc -l

191

answered Sep 23 '22 21:09

Yannick Wurm

You may also be interested in faSize, which is a tool from the Kent Source Tree, although this requires a bit more effort (you must dload and compile) than just using grep... here is some example output:

me@my-lab ~/data $ time faSize myfile.fna
215400419 bases (104761 N's 215295658 real 215295658 upper 0 lower) in 731620 sequences in 1 files
Total size: mean 294.4 sd 138.5 min 30 (F5854LK02GG895) max 1623 (F5854LK01AHBEH) median 307
N count: mean 0.1 sd 0.4
U count: mean 294.3 sd 138.5
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real

real    0m3.710s
user    0m3.541s
sys     0m0.164s

answered Sep 21 '22 21:09

brant.faircloth

Screed in python is brilliant:

import screed

for record in screed.open(fastafilename):
    print(len(record.sequence))

answered Sep 23 '22 21:09

mattmoore_bioinfo

Related questions
                            
                                Installing Bio::DB::Sam perl module
                            
                                Draw a colored sphere from cartesian coordinates in pymol
                            
                                Counting DNA Nucleotides using perl 6
                            
                                "average length of the sequences in a fasta file": Can you improve this Erlang code?
                            
                                Perl: Removing duplicates from a large set of data
                            
                                Generating Synthetic DNA Sequence with Substitution Rate
                            
                                Changing the x-axis of seqlogo figures in MATLAB
                            
                                Splitting scientific names [closed]
                            
                                R indexing string with character blocks denoting nucleotide variants
                            
                                How to extract the first hit elements from an XML NCBI BLAST file?
                            
                                How to order rows by conditions in other columns in r?
                            
                                Validate DNA in C/C++
                            
                                Implementing the Waterman-Eggert algorithm
                            
                                scikit-bio extract genomic features from gff3 file
                            
                                Python - Iteration over nested lists
                            
                                Regex to Match mRNA Sequences
                            
                                Efficiently construct GRanges/IRanges from Rle vector
                            
                                mitosis of a human cell
                            
                                Perl Inline::C: Are Inline_Stack_Vars etc. needed to avoid memory leaks (biosequence character matching)
                            
                                Populate list with tuples

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

multiFASTA file processing

Tags:

bioinformatics

bioconductor

biopython

fasta

bioperl