Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fasta file: replace header with filename

Tags:

bash

sed

fasta

I want to replace all the headers (starting with >) with >{filename}, of all *.fasta files inside my directory AND concatenate them afterwards

content of my directory

speciesA.fasta
speciesB.fasta
speciesC.fasta

example of file, speciesA.fasta

>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL

my desired output (only for speciesA.fasta now):

>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL

This is my code:

for file in *.fasta; do var=$(basename $file .fasta) | sed 's/>.*/>$var/' $var.fasta >>$var.outfile.fasta; done

but all I get is

>$var
MJSUNDKFJSKFJSKFJ
>$var
KEFJKSDJFKSDJFKSJFLSJDFLKSJF

[and so on ...]

Where did i make a mistake??

like image 420
rororo Avatar asked Jun 01 '17 05:06

rororo


2 Answers

The bash loop is superfluous. Try:

awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta

This approach is safe even if the file names contain special or regex-active characters.

How it works

  • /^>/ {print ">" substr(FILENAME, 1, length(FILENAME)-6); next}

    For any line that begins >, the commands in curly braces are executed. The first command prints > followed by all but the last 6 letters of the filename. The second command, next, skips the rest of the commands on the line and jumps to start over with the next line.

  • 1

    This is awk's cryptic shorthand for print-the-line.

Example

Let's consider a directory with two (identical) test files:

$ cat speciesA.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL
$ cat speciesB.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL

The output of our command is:

$ awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL
>speciesB
MJSUNDKFJSKFJSKFJ
>speciesB
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesB
KSDAFJLASDJFKLAJFL

The output has the substitutions and concatenates all the input files.

like image 127
John1024 Avatar answered Nov 14 '22 05:11

John1024


In sed you need to use double quotes for variable expansion. Otherwise, they will be considered as literal text.

for file in *.fasta;
   do
       sed -i "s/>.*/${file%%.*}/" "$file" ;
done
like image 33
P.... Avatar answered Nov 14 '22 04:11

P....