Split a multifasta file to files with the same number of accesion numbers

Question

I have a file that has thousands of accession numbers:

and looks like this..

>NC_033829.1 Kallithea virus isolate DrosEU46_Kharkiv_2014, complete genome
AGTCAGCAACGTCGATGTGGCGTACAATTTCTTGATTACATTTTTGTTCCTAACAAAATGTTGATATACT

>NC_020414.2 Escherichia phage UAB_Phi78, complete genome
TAGGCGTGTGTCAGGTCTCTCGGCCTCGGCCTCGCCGGGATGTCCCCATAGGGTGCCTGTGGGCGCTAGG

If want to split this to multiple files with one accession number each then I can use the following code

awk -F '|' '/^>/ {F=sprintf("%s.fasta",$2); print > F;next;} {print >> F;}' < yourfile.fa

I have a file with thousands of accession numbers (aka >NC_*) and want to split it such as each files contains ~ 5000 accession numbers. Since I am new to awk/bash/python i struggle to find a neat solution

Any idea or comment are appreciated

Ed Morton · Accepted Answer

It wasn't clear from your question that an "accession number" is unique per input block (don't assume the people reading your question know anything about your domain - it's all just lines of text to us). It would have been clearer if you had phrased your question to just say you want 5000 new-line-separated blocks per output file rather than 5000 accession numbers.

Having seen the answer you posted, it's now clear that this is what you should be using:

awk -v RS= -v ORS='\n\n' '
    (NR%5000) == 1 { close(out); out="myseq"(++n_seq)".fa" }
    { print > out }
' my_sequences.fa

Having seen the answer you posted, it's now clear that this is what you should be using:

awk -v RS= -v ORS='

' '
    (NR%5000) == 1 { close(out); out="myseq"(++n_seq)".fa" }
    { print > out }
' my_sequences.fa

KamilCuk · Answer

Assumptions: sections are separated by empty lines.

Algorithm:

split file on sections
extract accession number from section
output section to a filename named with accession number.

Awk terms: a "record" will be our section - part of file separated by empty line (i.e. two newline characters one after another. A "field" is usually separated by spaces - by separating by space or > character second field will be accession number.

Just set record separator to two newlines and field separator to > or space and then output the line to a filenamed named with second field:

awk -v RS='' -v FS='[> ]' '{f=($2 ".txt"); print >> f; close(f)}'

@edit changed > to >> and RS=' ' to RS=''

@edit and also added close

Split a multifasta file to files with the same number of accesion numbers

Tags:

python

bash

awk

biopython

LDT

2 Answers

Ed Morton

KamilCuk

Recent Activity

Donate For Us

Split a multifasta file to files with the same number of accesion numbers

Tags:

python

bash

awk

biopython

LDT

2 Answers

Ed Morton

KamilCuk

Related questions

Recent Activity

Donate For Us