Dear stackoverflow users,
I have TAB sep data like this:
head -4 input.tsv
seq A C change
seq T A ok
seq C C change
seq AC CCT change
And I need create reverse complement function in awk which do something like this
head -4 output.tsv
seq T G change
seq T A ok
seq G G change
seq GT AGG change
So if 4th column is flagged "change" I need to create reverse complement sequence.
HINT - the same things doing for example tr in bash - Bash one liner for this task is:
echo "ACCGA" | rev | tr "ATGC" "TACG"
I was tried something like this
awk 'BEGIN {c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" }{OFS="\t"}
function revcomp( i, o) {
o = ""
for(i = length; i > 0; i--)
o = o c[substr($0, i, 1)]
return(o)
}
{
if($4 == "change"){$2 = revcom(); $3 = revcom()} print $0; else print $0}' input
Biological reverse sequence mean:
A => T
C => G
G => C
T => A
and reverse complement mean:
ACCATG => CATGGT
Edited: Also anybody just for education can share this solution in python.
With a little tinkering of your attempt you can do something like below.
function revcomp(arg) {
o = ""
for(i = length(arg); i > 0; i--)
o = o c[substr(arg, i, 1)]
return(o)
}
BEGIN {c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" ; OFS="\t"}
{
if($4 == "change") {
$2 = revcomp($2);
$3 = revcomp($3)
}
}1
The key here was to use the function revcomp to take the arguments as the column values and operate on it by iterating from end. You were previously doing on the whole line $0, i.e. substr($0, i, 1) which would be causing a lot of unusual lookups on the array c.
I've also taken the liberty of changing the prototype of your function revcomp to take the input string and return the reversed one. Because I wasn't sure how you were intending to use in your original attempt.
If you are intending to use the above in a part of a larger script, I would recommend putting the whole code as-above in a script file, set the she-bang interpreter to #!/usr/bin/awk -f and run the script as awk -f script.awk input.tsv
A crude bash version implemented in awk would like below. Note that, it is not clean and not a recommended approach. See more at AllAboutGetline
As before call the function as $2 = revcomp_bash($2) and $3 = revcomp_bash($3)
function revcomp_bash(arg) {
o = ""
cmd = "printf \"%s\" " arg "| rev | tr \"ATGC\" \"TACG\""
while ( ( cmd | getline o ) > 0 ) {
}
close(cmd);
return(o)
}
Your whole code speaks GNU awk-ism, so didn't care for converting it to a POSIX compliant one. You could use split() with a empty de-limiter instead of length() but the POSIX specification gladly says that "The effect of a null string as the value of fs is unspecified."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With