Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Printing a sequence from a fasta file

Tags:

grep

bash

fasta

I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a '>' and then all the lines following until the next '>' are the sequence itself. For example:

>sequence1
ACTGACTGACTGACTG
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>sequence3
ACTGACTGACTGACTG

The way I'm currently getting the sequence I need is to use grep with -A, so I'll do

grep -A 10 sequence_name filename.fa

and then if I don't see the start of the next sequence in the file, I'll change the 10 to 20 and repeat until I'm sure I'm getting the whole sequence.

It seems like there should be a better way to do this. For example, can I ask it to print up until the next '>' character?

like image 917
Colin Avatar asked Dec 19 '22 10:12

Colin


1 Answers

Using the > as the record separator:

awk -v seq="sequence2" -v RS='>' '$1 == seq {print RS $0}' file
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
like image 123
glenn jackman Avatar answered Dec 22 '22 00:12

glenn jackman