Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly deobfusacte a Perl script?

I'm trying to deobfuscate the following Perl code (source):

#!/usr/bin/perl
(my$d=q[AA                GTCAGTTCCT
  CGCTATGTA                 ACACACACCA
    TTTGTGAGT                ATGTAACATA
      CTCGCTGGC              TATGTCAGAC
        AGATTGATC          GATCGATAGA
          ATGATAGATC     GAACGAGTGA
            TAGATAGAGT GATAGATAGA
              GAGAGA GATAGAACGA
                TC GATAGAGAGA
                 TAGATAGACA G
               ATCGAGAGAC AGATA
             GAACGACAGA TAGATAGAT
           TGAGTGATAG    ACTGAGAGAT
         AGATAGATTG        ATAGATAGAT
       AGATAGATAG           ACTGATAGAT
     AGAGTGATAG             ATAGAATGAG
   AGATAGACAG               ACAGACAGAT
  AGATAGACAG               AGAGACAGAT
  TGATAGATAG             ATAGATAGAT
  TGATAGATAG           AATGATAGAT
   AGATTGAGTG        ACAGATCGAT
     AGAACCTTTCT   CAGTAACAGT
       CTTTCTCGC TGGCTTGCTT
         TCTAA CAACCTTACT
           G ACTGCCTTTC
           TGAGATAGAT CGA
         TAGATAGATA GACAGAC
       AGATAGATAG  ATAGAATGAC
     AGACAGAGAG      ACAGAATGAT
   CGAGAGACAG          ATAGATAGAT
  AGAATGATAG             ACAGATAGAC
  AGATAGATAG               ACAGACAGAT
  AGACAGACTG                 ATAGATAGAT
   AGATAGATAG                 AATGACAGAT
     CGATTGAATG               ACAGATAGAT
       CGACAGATAG             ATAGACAGAT
         AGAGTGATAG          ATTGATCGAC
           TGATTGATAG      ACTGATTGAT
             AGACAGATAG  AGTGACAGAT
               CGACAGA TAGATAGATA
                 GATA GATAGATAG
                    ATAGACAGA G
                  AGATAGATAG ACA
                GTCGCAAGTTC GCTCACA
])=~s/\s+//g;%a=map{chr $_=>$i++}65,84,67,
71;$p=join$;,keys%a;while($d=~/([$p]{4})/g
){next if$j++%96>=16;$c=0;for$d(0..3){$c+=
$a{substr($1,$d,1)}*(4**$d)}$perl.=chr $c}
             eval $perl;

When run, it prints out Just another genome hacker.

After running the code trough Deparse and perltidy (perl -MO=Deparse jagh.pl | perltidy) the code looks like this:

( my $d =
"AA...GCTCACA\n" # snipped double helix part
) =~ s/\s+//g;
(%a) = map( { chr $_, $i++; } 65, 84, 67, 71 );
$p = join( $;, keys %a );
while ( $d =~ /([$p]{4})/g ) {
    next if $j++ % 96 >= 16;
    $c = 0;
    foreach $d ( 0 .. 3 ) {
        $c += $a{ substr $1, $d, 1 } * 4**$d;
    }
    $perl .= chr $c;
}

Here's what I've been able to decipher on my own.

( my $d =
"AA...GCTCACA\n" # snipped double helix part
) =~ s/\s+//g;

removes all whitespace in $d (the double helix).

(%a) = map( { chr $_, $i++; } 65, 84, 67, 71 );

makes a hash with as keys A, T, C and G and as values 0, 1, 2 and 3. I normally code in Python, so this translates to a dictionary {'A': 0, 'B': 1, 'C': 2, 'D': 3} in Python.

$p = join( $;, keys %a );

joins the keys of the hash with $; the subscript separator for multidimensional array emulation. The documentation says that the default is "\034", the same as SUBSEP in awk, but when I do:

my @ascii = unpack("C*", $p);
print @ascii[1];

I get the value 28? Also, it is not clear to me how this emulates a multidimensional array. Is $p now something like [['A'], ['T'], ['C'], ['G']] in Python?

    while ( $d =~ /([$p]{4})/g ) {

As long as $d matches ([$p]{4}), execute the code in the while block. but since I don't completely understand what structure $p is, i also have a hard time understanding what happens here.

next if $j++ % 96 >= 16;

Continue if the $j modulo 96 is greater or equal to 16. $j increments with each pass of the while loop (?).

$c = 0;
foreach $d ( 0 .. 3 ) {
    $c += $a{ substr $1, $d, 1 } * 4**$d;
}

For $d in the range from 0 to 3 extract some substring, but at this point I'm completely lost. The last few lines concatenate everything and evaluates the result.

like image 954
BioGeek Avatar asked Feb 18 '12 14:02

BioGeek


1 Answers

Caution: don't blindly run obfuscated perl, especially if there's an eval, backticks, system, open, etc. call somewhere in it and that might not be all too obvious*. De-obfuscating it with Deparse and carefully replacing the evals with print statements is a must until you understand what's going on. Running in a sandbox/with an unprivileged user/in a VM should be considered too.

*s&&$_ⅇ evaluates $_ for intance.


First observation: 034 is octal. It's equal to 28 (dec) or 0x1c (hex), so nothing fishy there.

The $; thing is purely obfuscation, can't find a reason to use that in particular. $p will just be a string A.T.C.G (with . replaced by $;, whatever it is).
So in the regex [$p] matches any of {'A', 'T', 'C', 'G', $;}. Since $; never appears in $d, it's useless there. In turn [$p]{4} matches any sequence of four letters in the above set, as if this had been used (ignoring the useless $;):

while ( $d =~ /([ATCG]{4})/g ) { ... }

If you had to write this yourself, after having removed whitespace, you'd just grab each successive substring of $d of length four (assuming there are no other chars in $d).

Now this part is fun:

foreach $d ( 0 .. 3 ) {
    $c += $a{ substr $1, $d, 1 } * 4**$d;
}
  • $1 holds the current four-letter codepoint. substr $1, $d, 1 returns each successive letter from that codepoint.
  • %a maps A to 00b (binary), T to 01b, C to 10b, and G to 11b.

    A   00
    T   01
    C   10
    G   11
    
  • multiplying by 4**$d will be equivalent to a bitwise left shift of 0, 2, 4 and 6.

So this funny construct allows you to build any 8bit value in the base-four system with ATCG as digits!

i.e. it does the following conversions:

         A A A A
AAAA -> 00000000

         T A A T
TAAT -> 01000001 -> capital A in ascii

         T A A C
CAAT -> 01000010 -> capital B in ascii

CAATTCCTGGCTGTATTTCTTTCTGCCT -> BioGeek

This part:

next if $j++ % 96 >= 16;

makes the above conversion run only for the first 16 "codepoints", skips the next 80, then converts for the next 16, skips the next 80, etc. It essentially just skips parts of the ellipse (junk DNA removal system).


Here's an ugly text to DNA converter that you could use to produce anything to replace the helix (doesn't handle the 80 skip thing):

use strict;
use warnings;
my $in = shift;

my %conv = ( 0 => 'A', 1 => 'T', 2 => 'C', 3 => 'G');

for (my $i=0; $i<length($in); $i++) {
    my $chr = substr($in, $i, 1);
    my $chv = ord($chr);
    my $encoded ="";
    $encoded .= $conv{($chv >> 0) & 0x3};
    $encoded .= $conv{($chv >> 2) & 0x3};
    $encoded .= $conv{($chv >> 4) & 0x3};
    $encoded .= $conv{($chv >> 6) & 0x3};
    print $encoded;
}
print "\n";
$ perl q.pl 'print "BioGeek\n";'
AAGTCAGTTCCTCGCTATGTAACACACACAATTCCTGGCTGTATTTCTTTCTGCCTAGTTCGCTCACAGCGA

Stick in $d that instead of the helix (and remove the skipping part in the decoder).

like image 180
Mat Avatar answered Nov 11 '22 12:11

Mat