I have a file that has thousands of accession numbers:
and looks like this..
>NC_033829.1 Kallithea virus isolate DrosEU46_Kharkiv_2014, complete genome
AGTCAGCAACGTCGATGTGGCGTACAATTTCTTGATTACATTTTTGTTCCTAACAAAATGTTGATATACT
>NC_020414.2 Escherichia phage UAB_Phi78, complete genome
TAGGCGTGTGTCAGGTCTCTCGGCCTCGGCCTCGCCGGGATGTCCCCATAGGGTGCCTGTGGGCGCTAGG
If want to split this to multiple files with one accession number each then I can use the following code
awk -F '|' '/^>/ {F=sprintf("%s.fasta",$2); print > F;next;} {print >> F;}' < yourfile.fa
I have a file with thousands of accession numbers (aka >NC_*) and want to split it such as each files contains ~ 5000 accession numbers. Since I am new to awk/bash/python i struggle to find a neat solution
Any idea or comment are appreciated
It wasn't clear from your question that an "accession number" is unique per input block (don't assume the people reading your question know anything about your domain - it's all just lines of text to us). It would have been clearer if you had phrased your question to just say you want 5000 new-line-separated blocks per output file rather than 5000 accession numbers.
Having seen the answer you posted, it's now clear that this is what you should be using:
awk -v RS= -v ORS='\n\n' '
(NR%5000) == 1 { close(out); out="myseq"(++n_seq)".fa" }
{ print > out }
' my_sequences.fa
Assumptions: sections are separated by empty lines.
Algorithm:
Awk terms: a "record" will be our section - part of file separated by empty line (i.e. two newline characters one after another. A "field" is usually separated by spaces - by separating by space or >
character second field will be accession number.
Just set record separator to two newlines and field separator to >
or space and then output the line to a filenamed named with second field:
awk -v RS='' -v FS='[> ]' '{f=($2 ".txt"); print >> f; close(f)}'
@edit changed >
to >>
and RS='\n\n'
to RS=''
@edit and also added close
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With