AWK

Question

I am trying to alter a column/field within a 'header' line of DNA sequences that is thousands of lines long. Specifically, I want to change the first field of the header (compX_seqy), which ALWAYS starts with ">":

An example of just the first two sequences:

 #cat example

 >comp0_seq1 444 [12:23]
 AGAGGACAC
 GATCCAACATA
 AGASCAC
 >comp0_seq2 333 [12:32:599:1]
 GTCGATC
 CYAACY
 CCCCA
 ...

I would like to add an "A" to the end of the first column only, for ALL lines starting with ">",

comp0_seq1A

Then print the rest of the line and then next lines (sequences) until the next ">" line is reached (and repeat).

I want the output to look like this :

>comp0_seq1A 444 [12:23]
AGAGGACAC
GATCCAACATA
AGASCAC
>comp0_seq2A 333 [12:32:599:1]
GTCGATC
CYAACY
CCCCA
...

I tried this first:

awk '$1=$1"A"' example

>comp0_seq1A 444 [12:23]
AGAGGACACA
GATCCAACATAA
AGASCACA
>comp0_seq2A 333 [12:32:599:1]
GTCGATCA
CYAACYA
CCCCAA
A
A

It adds an A to the first field of all lines, So not quite.

Then I tried this, using a regex to replace only lines starting with ">"

# awk '/^>/ {print $1=$1"A";getline;print $0}' example
>comp0_seq1A
AGAGGACAC
>comp0_seq2A
GTCGATC

But that only prints the first line AFTER the match. So, how to print all/any lines AFTER the match/replacement, and until the next ">"? I tried to use 'next', but I guess I dont understand how to use it in this context.

Any advice? I know I am close and am banging my head on my keyboard.

Thx, LP.

ghoti · Accepted Answer

You've almost got it. You're just overthinking things with your getline.

In awk, the following should work:

$ awk '/^>/ {$1=$1"A"} 1' file.txt

This works by running the command in curly braces on all lines that match the regular expression ^>. The 1 at the end is awk short-hand that says "print the current line".

Another option for a substitution this simple would be to use sed:

$ sed '/^>/s/ /A /' file.txt

This works by searching for lines that match the same regex, then replacing the first space with a string (/A /). sed will print each line by default, so no explicit print is required.

Or if you prefer something that substitutes the first "field" rather than the first "field separator", this can work:

$ sed 's/^$>[^ ]*$/\1A/' file.txt

By default, sed regexes are "BRE", so the grouping brackets need to be escaped. The \1 is a reference to the first (in this case "only") bracketed expression in the search regex.

AWK - replace specific column on matching line, then print other lines

Tags:

sed

LP_640

1 Answers

ghoti

Recent Activity

Donate For Us