Converting FASTQ to FASTA with SED/AWK

Question

I have a data in that always comes in block of four in the following format (called FASTQ):

@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/

Is there a simple sed/awk/bash way to convert them into this format (called FASTA):

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

In principle, we want to extract the first two lines in each block-of-4 and replace @ with >.

Owen · Accepted Answer

This is an old question, and there have been many different solutions offered. Since the accepted answer uses sed but has a glaring problem (which is that it will replace @ with > when the @ sign appears as the first letter of the quality line), I feel compelled to offer a simple sed-based solution that actually works:

sed -n '1~4s/^@/>/p;2~4p'

The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.

The fastq_to_fasta script in the fastx toolkit also works. (It's worth mentioning that you need to specify the -Q33 option to accommodate the now common Phred+33 qual encodings. Which is funny, since it's throwing away the quality data anyway!)

Mark Edgar · Answer

sed ain't dead. If we're golfing:

sed '/^@/!d;s//>/;N'

Or, emulating http://www.ringtail.tsl.ac.uk/david-studholme/scripts/fastq2fasta.pl posted by Pierre, which only prints the first word (the id) from the first line and does (some) error handling:

#!/usr/bin/sed -f
# Read a total of four lines
$b error
N;$b error
N;$b error
N
# Parse the lines
/^@$\([^ ]*$.*\)$\n[ACGTN]*$\n+\1\n.*$/{
  # Output id and sequence for FASTA format.
  s//>\2\3/
  b
}
:error
i\
Error parsing input:
q

There seem to be plenty of existing tools for converting these formats; you should probably use these instead of anything posted here (including the above).

C. Bergman · Answer

As detailed in Cock, et al (2009) NAR, many of these solutions are incorrect since "the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence."

See http://ukpmc.ac.uk/articlerender.cgi?accid=PMC2847217 for details.

ghostdog74 · Answer

just awk , no need other tools

# awk '/^@SR/{gsub(/^@/,">",$1);print;getline;print}' file
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Converting FASTQ to FASTA with SED/AWK

Tags:

shell

sed

awk

fasta

fastq

neversaint

4 Answers

Owen

Mark Edgar

C. Bergman

ghostdog74

Recent Activity

Donate For Us

Converting FASTQ to FASTA with SED/AWK

Tags:

shell

sed

awk

fasta

fastq

neversaint

4 Answers

Owen

Mark Edgar

C. Bergman

ghostdog74

Related questions

Recent Activity

Donate For Us