extract sequences from multifasta file by ID in file using awk

Question

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.

FASTA file seq.fasta:

>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT

IDs file id.txt:

7P58X:01332:11636
7P58X:01334:11613

I want to get the fasta file with only those sequences matching the IDs in the id.txt file:

>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC

I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:

(1)

awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta

this code works well for the multiline sequences but IDs have to be inserted separately to the code.

(2)

awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta

this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.

I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?

Ed Morton · Accepted Answer

$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC

RavinderSingh13 · Answer

Following awk may help you on same.

awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt  fasta_file

extract sequences from multifasta file by ID in file using awk

Tags:

search

multiline

awk

bioinformatics

fasta

Dalibor Miklík

2 Answers

Ed Morton

RavinderSingh13

Recent Activity

Donate For Us

extract sequences from multifasta file by ID in file using awk

Tags:

search

multiline

awk

bioinformatics

fasta

Dalibor Miklík

2 Answers

Ed Morton

RavinderSingh13

Related questions

Recent Activity

Donate For Us