Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove line breaks in a FASTA file

I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file:

>accession1
ATGGCCCATG
GGATCCTAGC
>accession2
GATATCCATG
AAACGGCTTA

I'd like to convert it into this:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

I found a potential solution on this site, which looks like this:

cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta

However, this places an extra line break between each entry, so file looks like this:

>accession1
ATGGCCCATGGGATCCTAGC

>accession2
GATATCCATGAAACGGCTTA

I'm an awk noob, but I took a shot at modifying the command. My guess was the if (p){print "\n";} was the culprit...potentially print "\n" is adding two line breaks. I couldn't figure out how to add just one newline...this is probably something easy, but like I said, I'm a noob. Here was my (unsuccessful) solution:

awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' input.fasta > joinedoutput.fasta

However, this adds an empty line at the beginning of the file because it's always printing a new line before it prints the first accession number:

{empty line} 
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Anyone have a solution to get my file in the correct format? Thanks!

like image 454
chimeric Avatar asked Apr 06 '13 23:04

chimeric


2 Answers

I would use sed for this. Using GNU sed:

sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file

Results:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

Create a label, a. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >, perform the substitution s/\n\([^>]\)/\1/. If the substitution was successful since the last input line was read, then branch to label a. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.

like image 185
Steve Avatar answered Sep 27 '22 15:09

Steve


This awk program:

% awk '!/^>/ { printf "%s", $0; n = "\n" } 
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta

Will yield:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

On lines that don't start with a >, print the line without a line break and store a newline character (in variable n) for later.

On lines that do start with a >, print the stored newline character (if any) and the line. Reset n, in case this is the last line.

End with a newline, if required.

Note:

By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.

--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide

like image 45
Johnsyweb Avatar answered Sep 27 '22 16:09

Johnsyweb