Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiFASTA file processing

I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots. Also an R BIoconductor solution or a BioPerl module would do, but I didn't manage to find anything.

Can you help me? Thanks a lot :-)

like image 974
Federico Giorgi Avatar asked Nov 24 '09 10:11

Federico Giorgi


People also ask

What is a MultiFasta file?

Multi-fasta file: A text file file containing several DNA sequences in fasta format. Every fasta entry has 2 fundamental blocks. The first one is a single text line starting by '>' character following by a sequence description. The second block is the sequence and may contain several lines.

How do you get a Fasta sequence?

Download FASTA and GenBank flat fileYou can download sequence and other data from the graphical viewer by accessing the Download menu on the toolbar. You can download the FASTA formatted sequence of the visible range, all markers created on the sequence, or all selections made of the sequence.


3 Answers

Some of the emboss tools are a collection of small tools that can help you out.

  • seqstats returns sequence length
  • pepstats should give you aminoacid content etc. Some of the tools also offer plotting functions. Very handy. http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/groups.html

To count number of fasta entries, I use: grep -c '^>' mySequences.fasta.

To make sure none of the entries are duplicate, I check that I get the same number when doing this: grep '^>' mySequences.fasta | sort | uniq | wc -l

like image 191
Yannick Wurm Avatar answered Sep 23 '22 21:09

Yannick Wurm


You may also be interested in faSize, which is a tool from the Kent Source Tree, although this requires a bit more effort (you must dload and compile) than just using grep... here is some example output:

me@my-lab ~/data $ time faSize myfile.fna
215400419 bases (104761 N's 215295658 real 215295658 upper 0 lower) in 731620 sequences in 1 files
Total size: mean 294.4 sd 138.5 min 30 (F5854LK02GG895) max 1623 (F5854LK01AHBEH) median 307
N count: mean 0.1 sd 0.4
U count: mean 294.3 sd 138.5
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real

real    0m3.710s
user    0m3.541s
sys     0m0.164s
like image 42
brant.faircloth Avatar answered Sep 21 '22 21:09

brant.faircloth


Screed in python is brilliant:

import screed

for record in screed.open(fastafilename):
    print(len(record.sequence))
like image 28
mattmoore_bioinfo Avatar answered Sep 23 '22 21:09

mattmoore_bioinfo