I'm trying to find hard facts that will help my management understand how hard/easy it is to reverse-engineer compiled C code.
Similar questions have been asked before on this site (see e.g. Is it possible to “decompile” a Windows .exe? Or at least view the Assembly? or Possible to decompile DLL written in C?), but the gist of these questions is that decompiling compiled C code is "hard, but not entirely impossible".
In order to facilitate answers that are based in fact, I am including compiled code for a mystery function, and I propose that answers to this question measure the success or failure of the proposed techniques by whether they can determine what this function does. This may be unusual for SO but I think it's the best way to get "good subjective" or factual answers to this engineering question. Therefore, What is your best guess at what this function is doing, and how?
This is the compiled code, compiled on Mac OSX with gcc:
_mystery:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movsd LCPI1_0(%rip), %xmm1
subsd %xmm0, %xmm1
pxor %xmm2, %xmm2
ucomisd %xmm1, %xmm2
jbe LBB1_2
xorpd LCPI1_1(%rip), %xmm1
LBB1_2:
ucomisd LCPI1_2(%rip), %xmm1
jb LBB1_8
movsd LCPI1_0(%rip), %xmm1
movsd LCPI1_3(%rip), %xmm2
pxor %xmm3, %xmm3
movsd LCPI1_1(%rip), %xmm4
jmp LBB1_4
.align 4, 0x90
LBB1_5:
ucomisd LCPI1_2(%rip), %xmm1
jb LBB1_9
movapd %xmm5, %xmm1
LBB1_4:
movapd %xmm0, %xmm5
divsd %xmm1, %xmm5
addsd %xmm1, %xmm5
mulsd %xmm2, %xmm5
movapd %xmm5, %xmm1
mulsd %xmm1, %xmm1
subsd %xmm0, %xmm1
ucomisd %xmm1, %xmm3
jbe LBB1_5
xorpd %xmm4, %xmm1
jmp LBB1_5
LBB1_8:
movsd LCPI1_0(%rip), %xmm5
LBB1_9:
movapd %xmm5, %xmm0
popq %rbp
ret
Leh_func_end1:
UPDATE
@Igor Skochinsky is the first to find the right answer: it is indeed a naive implementation of Heron's algorithm for calculating square roots. The original source code is here:
#include <stdio.h>
#define EPS 1e-7
double mystery(double x){
double y=1.;
double diff;
diff=y*y-x;
diff=diff<0?-diff:diff;
while(diff>=EPS){
y=(y+x/y)/2.;
diff=y*y-x;
diff=diff<0?-diff:diff;
}
return y;
}
int main() {
printf("The square root of 2 is %g\n", mystery(2.));
}
Because C++ compilers generally do not put any more information into the executable than they absolutely have to (especially not if they are compiling in release mode rather than a debug build), so the information you'd need to accurately decompile the program simply is not present in the executable.
It is also not possible to decompile all programs. Furthermore, it is not easy to separate data and code because both are represented similarly in most current computer systems. A type of reverse engineering, a decompiler performs the opposite operations of a compiler.
Decompiling is both illegal and wrong, unless it's your own work. You can learn what you need on Google, or find open-source stuff using it and learn from that. It's illegal to decompile ANYTHING without permission.
Every application must be executed -- read and understood -- by something. Therefore, every application can be "decompiled".
Here's the results of decompilation with the Hex-Rays Decompiler after I converted code to x86 (it does not support x64 at the moment), added some data definitions missing in the original post, and assembled it:
//-------------------------------------------------------------------------
// Data declarations
double LCPI1_0 = 1.0; // weak
double LCPI1_1[2] = { 0.0, 0.0 }; // weak
double LCPI1_2 = 1.2; // weak
double LCPI1_3 = 1.3; // weak
//----- (00000000) --------------------------------------------------------
void __usercall mystery(__m128d a1<xmm0>)
{
__m128d v1; // xmm1@1
__m128d v2; // xmm1@4
__int128 v3; // xmm2@4
__m128d v4; // xmm5@7
__m128d v5; // xmm1@7
v1 = (__m128d)*(unsigned __int64 *)&LCPI1_0;
v1.m128d_f64[0] = LCPI1_0 - a1.m128d_f64[0];
if ( LCPI1_0 - a1.m128d_f64[0] < 0.0 )
v1 = _mm_xor_pd(v1, *(__m128d *)LCPI1_1);
if ( v1.m128d_f64[0] >= LCPI1_2 )
{
v2 = (__m128d)*(unsigned __int64 *)&LCPI1_0;
v3 = *(unsigned __int64 *)&LCPI1_3;
while ( 1 )
{
v4 = a1;
v4.m128d_f64[0] = (v4.m128d_f64[0] / v2.m128d_f64[0] + v2.m128d_f64[0]) * *(double *)&v3;
v5 = v4;
v5.m128d_f64[0] = v5.m128d_f64[0] * v5.m128d_f64[0] - a1.m128d_f64[0];
if ( v5.m128d_f64[0] < 0.0 )
v5 = _mm_xor_pd(a1, (__m128d)*(unsigned __int64 *)LCPI1_1);
if ( v5.m128d_f64[0] < LCPI1_2 )
break;
v2 = a1;
}
}
}
// 90: using guessed type double LCPI1_0;
// 98: using guessed type double LCPI1_1[2];
// A8: using guessed type double LCPI1_2;
// B0: using guessed type double LCPI1_3;
// ALL OK, 1 function(s) have been successfully decompiled
Clearly, it could use some improvement (XMM support is somewhat basic right now), but I think the basic algorithm is already understandable.
Edit: since it's apparent that only the low double of all XMM registers is used, it seems the function actually works with scalar doubles and not vectors. As for the _mm_xor_pd (xorpd) intrinsic, I think it's just the way the compiler implements sign inversion - by xoring with a predefined constant which has 1s in sign bit positions and 0s everywhere else. With the above in mind, and after some cleanup, I get the following code:
double mystery(double a1)
{
double v1; // xmm1@1
double v2; // xmm1@4
double v3; // xmm2@4
double v4; // xmm5@7
double v5; // xmm1@7
v1 = LCPI1_0 - a1;
if ( v1 < 0.0 )
v1 = -v1;
if ( v1 < LCPI1_2 )
{
v4 = LCPI1_0;
}
else
{
v2 = LCPI1_0;
v3 = LCPI1_3;
while ( 1 )
{
v4 = a1;
v4 = (v4 / v2 + v2) * v3;
v5 = v4;
v5 = v5 * v5 - a1;
if ( v5 < 0.0 )
v5 = -v5;
if ( v5 < LCPI1_2 )
break;
v2 = a1;
}
}
return v4;
}
It produces assembly pretty similar to the original post.
Reverse engineering / decompiling any code is a matter of the time it takes vs the benefit in doing so; not how hard it is to do.
If you have some secret sauce that you absolutely can't allow to get out, then the only thing you can do is have that secret sauce as a web service which gets called upon as necessary. This way the binaries never leave your corporate walls.
Even obfuscation only goes so far as anything can be traced once a hacker has the compiled binaries on a system they control. Heck, the original PC clones were created by reverse engineering the IBM BIOS.
So, back to the point: Again, it's not a question of how hard something is, it's more a question of whether anyone would want to try... which is based on what perceived value they would get out of it. Whether direct dollars (receiving or saving), competitive advantage or simply bragging rights. Compounding this is the availability of the application: wider distribution equals higher potential for finding it's way into a hackers bucket of things to work on.
If those values exist, then you can be assured that someone will try and they will succeed. Which should lead you to the next question: What if they do? What's the worst outcome?
In some cases it's simply a lost sale, that you may not have gotten anyway. In others it could be the loss of the business.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With