Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a multifasta file to files with the same number of accesion numbers

I have a file that has thousands of accession numbers:

and looks like this..

>NC_033829.1 Kallithea virus isolate DrosEU46_Kharkiv_2014, complete genome
AGTCAGCAACGTCGATGTGGCGTACAATTTCTTGATTACATTTTTGTTCCTAACAAAATGTTGATATACT

>NC_020414.2 Escherichia phage UAB_Phi78, complete genome
TAGGCGTGTGTCAGGTCTCTCGGCCTCGGCCTCGCCGGGATGTCCCCATAGGGTGCCTGTGGGCGCTAGG

If want to split this to multiple files with one accession number each then I can use the following code

awk -F '|' '/^>/ {F=sprintf("%s.fasta",$2); print > F;next;} {print >> F;}' < yourfile.fa

I have a file with thousands of accession numbers (aka >NC_*) and want to split it such as each files contains ~ 5000 accession numbers. Since I am new to awk/bash/python i struggle to find a neat solution

Any idea or comment are appreciated

like image 205
LDT Avatar asked Jul 25 '21 19:07

LDT


2 Answers

It wasn't clear from your question that an "accession number" is unique per input block (don't assume the people reading your question know anything about your domain - it's all just lines of text to us). It would have been clearer if you had phrased your question to just say you want 5000 new-line-separated blocks per output file rather than 5000 accession numbers.

Having seen the answer you posted, it's now clear that this is what you should be using:

awk -v RS= -v ORS='\n\n' '
    (NR%5000) == 1 { close(out); out="myseq"(++n_seq)".fa" }
    { print > out }
' my_sequences.fa
like image 181
Ed Morton Avatar answered Oct 24 '22 22:10

Ed Morton


Assumptions: sections are separated by empty lines.

Algorithm:

  • split file on sections
  • extract accession number from section
  • output section to a filename named with accession number.

Awk terms: a "record" will be our section - part of file separated by empty line (i.e. two newline characters one after another. A "field" is usually separated by spaces - by separating by space or > character second field will be accession number.

Just set record separator to two newlines and field separator to > or space and then output the line to a filenamed named with second field:

awk -v RS='' -v FS='[> ]' '{f=($2 ".txt"); print >> f; close(f)}'

@edit changed > to >> and RS='\n\n' to RS=''

@edit and also added close

like image 28
KamilCuk Avatar answered Oct 24 '22 23:10

KamilCuk