Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sequence length of FASTA file

Tags:

bash

awk

fasta

I have the following FASTA file:

>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT

My desired output:

>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.

This is my code:

awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa

The output I get with this code is:

>header1
60
57
>header2
3
>header3
7

I need a small modification in order to deal with multiple sequence lines.

I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.

like image 806
cucurbit Avatar asked Jun 02 '14 10:06

cucurbit


Video Answer


2 Answers

A quick way with any awk, would be this:

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta

You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files

bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

like image 162
kvantour Avatar answered Oct 12 '22 03:10

kvantour


An awk / gawk solution can be composed by three stages:

  1. Every time header is found these actions should be performed:

    • Print previous seqlen if exists.
    • Print tag.
    • Initialize seqlen.
  2. For the sequence lines we just need to accumulate totals.
  3. Finally at the END stage we print the remnant seqlen.

Commented code:

awk '/^>/ { # header pattern detected
        if (seqlen){
         # print previous seqlen if exists 
         print seqlen
         }

         # pring the tag 
         print

         # initialize sequence
         seqlen = 0

         # skip further processing
         next
      }

# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa

A oneliner:

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa

For the totals:

awk '/^>/ { if (seqlen) {
              print seqlen
              }
            print

            seqtotal+=seqlen
            seqlen=0
            seq+=1
            next
            }
    {
    seqlen += length($0)
    }     
    END{print seqlen
        print seq" sequences, total length " seqtotal+seqlen
    }' file.fa
like image 26
Juan Diego Godoy Robles Avatar answered Oct 12 '22 01:10

Juan Diego Godoy Robles