I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots. Also an R BIoconductor solution or a BioPerl module would do, but I didn't manage to find anything.
Can you help me? Thanks a lot :-)
Multi-fasta file: A text file file containing several DNA sequences in fasta format. Every fasta entry has 2 fundamental blocks. The first one is a single text line starting by '>' character following by a sequence description. The second block is the sequence and may contain several lines.
Download FASTA and GenBank flat fileYou can download sequence and other data from the graphical viewer by accessing the Download menu on the toolbar. You can download the FASTA formatted sequence of the visible range, all markers created on the sequence, or all selections made of the sequence.
Some of the emboss tools are a collection of small tools that can help you out.
seqstats
returns sequence lengthpepstats
should give you aminoacid content etc.
Some of the tools also offer plotting functions. Very handy.
http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/groups.html
To count number of fasta entries, I use:
grep -c '^>' mySequences.fasta
.
To make sure none of the entries are duplicate, I check that I get the same number when doing this: grep '^>' mySequences.fasta | sort | uniq | wc -l
You may also be interested in faSize, which is a tool from the Kent Source Tree, although this requires a bit more effort (you must dload and compile) than just using grep... here is some example output:
me@my-lab ~/data $ time faSize myfile.fna
215400419 bases (104761 N's 215295658 real 215295658 upper 0 lower) in 731620 sequences in 1 files
Total size: mean 294.4 sd 138.5 min 30 (F5854LK02GG895) max 1623 (F5854LK01AHBEH) median 307
N count: mean 0.1 sd 0.4
U count: mean 294.3 sd 138.5
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real
real 0m3.710s
user 0m3.541s
sys 0m0.164s
Screed in python is brilliant:
import screed
for record in screed.open(fastafilename):
print(len(record.sequence))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With