I want to replace all the headers (starting with >
) with >{filename}
, of all *.fasta
files inside my directory
AND concatenate them afterwards
content of my directory
speciesA.fasta
speciesB.fasta
speciesC.fasta
example of file, speciesA.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL
my desired output (only for speciesA.fasta
now):
>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL
This is my code:
for file in *.fasta; do var=$(basename $file .fasta) | sed 's/>.*/>$var/' $var.fasta >>$var.outfile.fasta; done
but all I get is
>$var
MJSUNDKFJSKFJSKFJ
>$var
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
[and so on ...]
Where did i make a mistake??
The bash loop is superfluous. Try:
awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
This approach is safe even if the file names contain special or regex-active characters.
/^>/ {print ">" substr(FILENAME, 1, length(FILENAME)-6); next}
For any line that begins >
, the commands in curly braces are executed. The first command prints >
followed by all but the last 6 letters of the filename. The second command, next
, skips the rest of the commands on the line and jumps to start over with the next
line.
1
This is awk's cryptic shorthand for print-the-line.
Let's consider a directory with two (identical) test files:
$ cat speciesA.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL
$ cat speciesB.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL
The output of our command is:
$ awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL
>speciesB
MJSUNDKFJSKFJSKFJ
>speciesB
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesB
KSDAFJLASDJFKLAJFL
The output has the substitutions and concatenates all the input files.
In sed
you need to use double quotes for variable expansion. Otherwise, they will be considered as literal text.
for file in *.fasta;
do
sed -i "s/>.*/${file%%.*}/" "$file" ;
done
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With