Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create reverse complement sequence based on AWK

Tags:

python

awk

Dear stackoverflow users,

I have TAB sep data like this:

head -4 input.tsv

seq A      C change
seq T      A ok
seq C      C change
seq AC   CCT change

And I need create reverse complement function in awk which do something like this

head -4 output.tsv

seq T      G change
seq T      A ok
seq G      G change
seq GT   AGG change

So if 4th column is flagged "change" I need to create reverse complement sequence.

HINT - the same things doing for example tr in bash - Bash one liner for this task is:

echo "ACCGA" | rev | tr "ATGC" "TACG"

I was tried something like this

awk 'BEGIN {c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" }{OFS="\t"}
function revcomp(   i, o) {
    o = ""
    for(i = length; i > 0; i--)
        o = o c[substr($0, i, 1)]
    return(o)
}
{

if($4 == "change"){$2 = revcom(); $3 = revcom()} print $0; else print $0}' input

Biological reverse sequence mean:

A => T
C => G
G => C
T => A

and reverse complement mean:

ACCATG => CATGGT

Edited: Also anybody just for education can share this solution in python.

like image 244
Geroge Avatar asked May 21 '26 04:05

Geroge


1 Answers

With a little tinkering of your attempt you can do something like below.

function revcomp(arg) {
    o = ""
    for(i = length(arg); i > 0; i--)
        o = o c[substr(arg, i, 1)]
    return(o)
}

BEGIN {c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" ; OFS="\t"}

{
    if($4 == "change") {
        $2 = revcomp($2); 
        $3 = revcomp($3)
    } 
}1

The key here was to use the function revcomp to take the arguments as the column values and operate on it by iterating from end. You were previously doing on the whole line $0, i.e. substr($0, i, 1) which would be causing a lot of unusual lookups on the array c.

I've also taken the liberty of changing the prototype of your function revcomp to take the input string and return the reversed one. Because I wasn't sure how you were intending to use in your original attempt.

If you are intending to use the above in a part of a larger script, I would recommend putting the whole code as-above in a script file, set the she-bang interpreter to #!/usr/bin/awk -f and run the script as awk -f script.awk input.tsv

A crude bash version implemented in awk would like below. Note that, it is not clean and not a recommended approach. See more at AllAboutGetline

As before call the function as $2 = revcomp_bash($2) and $3 = revcomp_bash($3)

function revcomp_bash(arg) {
    o = ""
    cmd = "printf \"%s\" " arg "| rev | tr \"ATGC\" \"TACG\""
    while ( ( cmd | getline o ) > 0 ) { 

    }
    close(cmd);
    return(o)
}

Your whole code speaks GNU awk-ism, so didn't care for converting it to a POSIX compliant one. You could use split() with a empty de-limiter instead of length() but the POSIX specification gladly says that "The effect of a null string as the value of fs is unspecified."

like image 194
Inian Avatar answered May 23 '26 19:05

Inian