I am trying to alter a column/field within a 'header' line of DNA sequences that is thousands of lines long. Specifically, I want to change the first field of the header (compX_seqy), which ALWAYS starts with ">":
An example of just the first two sequences:
#cat example
>comp0_seq1 444 [12:23]
AGAGGACAC
GATCCAACATA
AGASCAC
>comp0_seq2 333 [12:32:599:1]
GTCGATC
CYAACY
CCCCA
...
I would like to add an "A" to the end of the first column only, for ALL lines starting with ">",
comp0_seq1A
Then print the rest of the line and then next lines (sequences) until the next ">" line is reached (and repeat).
I want the output to look like this :
>comp0_seq1A 444 [12:23]
AGAGGACAC
GATCCAACATA
AGASCAC
>comp0_seq2A 333 [12:32:599:1]
GTCGATC
CYAACY
CCCCA
...
I tried this first:
awk '$1=$1"A"' example
>comp0_seq1A 444 [12:23]
AGAGGACACA
GATCCAACATAA
AGASCACA
>comp0_seq2A 333 [12:32:599:1]
GTCGATCA
CYAACYA
CCCCAA
A
A
It adds an A to the first field of all lines, So not quite.
Then I tried this, using a regex to replace only lines starting with ">"
# awk '/^>/ {print $1=$1"A";getline;print $0}' example
>comp0_seq1A
AGAGGACAC
>comp0_seq2A
GTCGATC
But that only prints the first line AFTER the match. So, how to print all/any lines AFTER the match/replacement, and until the next ">"? I tried to use 'next', but I guess I dont understand how to use it in this context.
Any advice? I know I am close and am banging my head on my keyboard.
Thx, LP.
You've almost got it. You're just overthinking things with your getline
.
In awk
, the following should work:
$ awk '/^>/ {$1=$1"A"} 1' file.txt
This works by running the command in curly braces on all lines that match the regular expression ^>
. The 1
at the end is awk short-hand that says "print the current line".
Another option for a substitution this simple would be to use sed
:
$ sed '/^>/s/ /A /' file.txt
This works by searching for lines that match the same regex, then replacing the first space with a string (/A /
). sed
will print each line by default, so no explicit print is required.
Or if you prefer something that substitutes the first "field" rather than the first "field separator", this can work:
$ sed 's/^\(>[^ ]*\)/\1A/' file.txt
By default, sed
regexes are "BRE", so the grouping brackets need to be escaped. The \1
is a reference to the first (in this case "only") bracketed expression in the search regex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With