Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reverse engineering C-source code from assembly

I would like to know if anyone can help me out with a problem I am having when studying one of the lecture slides from an introductory assembly class that I am taking in school. The problem I am having is not understanding the assembly, it is how exactly the C source code is ordered based on the assembly. I will post the snippet I am talking about and maybe it will be clearer what I am talking about.

C Source given:

int arith(int x, int y, int z)
{ 
   int t1 = x+y;
   int t2 = z+t1;
   int t3 = x+4;
   int t4 = y * 48; 
   int t5 = t3 + t4;
   int rval = t2 * t5;
   return rval;
}

Assembly given:

arith:
pushl %ebp
movl %esp,%ebp

movl 8(%ebp),%eax
movl 12(%ebp),%edx
leal (%edx,%eax),%ecx
leal (%edx,%edx,2),%edx
sall $4,%edx
addl 16(%ebp),%ecx
leal 4(%edx,%eax),%eax
imull %ecx,%eax

movl %ebp,%esp
popl %ebp
ret

I am just confused as to how I am supposed to be able to discern for example that the adding of z + t1 (z + x + y) is listed on the second line(in the source) when in the assembly it comes after the y * 48 in the assembly code or for example that x + 4 is the 3rd line when in the assembly it is not even in a line by itself, its sort of mixed in with the last leal statement. It makes sense to me when I have the source but I am supposed to be able to reproduce the source for a test and I do understand that the compiler optimizes things but if anyone has a way of thinking about the reverse engineering that could help me out I would greatly appreciate it if they could walk me through their thought process.

Thanks.

like image 791
cpowel2 Avatar asked Nov 13 '11 18:11

cpowel2


People also ask

Can you reverse engineer compiled C code?

The Reverser allows you to reverse engineer compilable C code to a model, which you may want to do for the following reasons: To view the structure of the C code in Modeler. To develop the C code further in Modeler before regenerating the code. To move the C code to another platform, such as C++ or Java.

How do I decompile assembly code?

To do this, go to the Modules window and from the context menu of a . NET assembly, and then select the Decompile source code command. Visual Studio generates a symbol file for the assembly and then embeds the source into the symbol file. In a later step, you can extract the embedded source code.

How is assembly language used in reverse engineering?

Assembly programming for the reverse engineer is about learning how to write assembly. On top of this, it's also learning how the computer works in order to understand generated blocks of code and how the operating system deals with the user and the machine.

Is reverse engineering source code legal?

Done correctly, there is nothing wrong with reverse engineering, and it is not considered an “improper means” of gathering information, as defined by the Defend Trade Secrets Act (DTSA).


4 Answers

I've broken down the disassembly for you to show how the assembly was produced from the C source.

8(%ebp) = x, 12(%ebp) = y, 16(%ebp) = z

arith:

Create the stack frame:

pushl %ebp
movl %esp,%ebp


Move x into eax, y into edx:
movl 8(%ebp),%eax
movl 12(%ebp),%edx


t1 = x + y. leal (Load effective address) will add edx and eax, and t1 will be in ecx:
leal (%edx,%eax),%ecx


int t4 = y * 48; in two steps below, multiply by 3, then by 16. t4 will eventually be in edx:

Multiply edx by 2, and add edx to the result, ie. edx = edx * 3:

leal (%edx,%edx,2),%edx

Shift left 4 bits, ie. multiply by 16:

sall $4,%edx


int t2 = z+t1;. ecx initially holds t1, z is at 16(%ebp), at the end of the instruction ecx will be holding t2:
addl 16(%ebp),%ecx


int t5 = t3 + t4;. t3 was simply x + 4, and rather than calculating and storing t3, the expression of t3 is placed inline. This instruction essential does (x+4) + t4, which is the same as t3 + t4. It adds edx (t4) and eax (x), and adds 4 as an offset to achieve that result.
leal 4(%edx,%eax),%eax

int rval = t2 * t5; Fairly straight-forward this one; ecx represents t2 and eax represents t5. The return value is passed back to the caller through eax.

imull %ecx,%eax


Destroy the stack frame and restore esp and ebp:
movl %ebp,%esp
popl %ebp


Return from the routine:
ret


From this example you can see that the result is the same, but the structure is a bit different. Most likely this code was compiled with some sort of optimization or someone wrote it themself like that to demonstrate a point.

As others have said, you can't go exactly back to the source from the disassembly. It's up to the interpretation of the person reading the assembly to come up with equivalent C code.


To help with learning assembly and understanding the disassembly of your C programs, you can do the following on Linux:

Compile with debug information (-g), which will embed the source:

gcc -c -g arith.c

If you're on a 64-bit machine, you can tell the compiler to create a 32-bit binary with the -m32 flag (I did so for the example below).


Use objdump to dump the object file with it's source interleaved:

objdump -d -S arith.o

-d = disassembly, -S = display source. You can add -M intel-mnemonic to use the Intel ASM syntax if you prefer that over the AT&T syntax that your example uses.

Output:

arith.o:     file format elf32-i386


Disassembly of section .text:

00000000 <arith>:
int arith(int x, int y, int z)
{ 
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   83 ec 20                sub    $0x20,%esp
   int t1 = x+y;
   6:   8b 45 0c                mov    0xc(%ebp),%eax
   9:   8b 55 08                mov    0x8(%ebp),%edx
   c:   01 d0                   add    %edx,%eax
   e:   89 45 fc                mov    %eax,-0x4(%ebp)
   int t2 = z+t1;
  11:   8b 45 fc                mov    -0x4(%ebp),%eax
  14:   8b 55 10                mov    0x10(%ebp),%edx
  17:   01 d0                   add    %edx,%eax
  19:   89 45 f8                mov    %eax,-0x8(%ebp)
   int t3 = x+4;
  1c:   8b 45 08                mov    0x8(%ebp),%eax
  1f:   83 c0 04                add    $0x4,%eax
  22:   89 45 f4                mov    %eax,-0xc(%ebp)
   int t4 = y * 48; 
  25:   8b 55 0c                mov    0xc(%ebp),%edx
  28:   89 d0                   mov    %edx,%eax
  2a:   01 c0                   add    %eax,%eax
  2c:   01 d0                   add    %edx,%eax
  2e:   c1 e0 04                shl    $0x4,%eax
  31:   89 45 f0                mov    %eax,-0x10(%ebp)
   int t5 = t3 + t4;
  34:   8b 45 f0                mov    -0x10(%ebp),%eax
  37:   8b 55 f4                mov    -0xc(%ebp),%edx
  3a:   01 d0                   add    %edx,%eax
  3c:   89 45 ec                mov    %eax,-0x14(%ebp)
   int rval = t2 * t5;
  3f:   8b 45 f8                mov    -0x8(%ebp),%eax
  42:   0f af 45 ec             imul   -0x14(%ebp),%eax
  46:   89 45 e8                mov    %eax,-0x18(%ebp)
   return rval;
  49:   8b 45 e8                mov    -0x18(%ebp),%eax
}
  4c:   c9                      leave  
  4d:   c3                      ret

As you can see, without optimizations the compiler produces a larger binary than the example you have. You can play around with that and add a compiler optimization flag when compiling (ie. -O1, -O2, -O3). The higher the optimization level, the more abstract the disassembly's going to seem.

For example, with just level 1 optimization (gcc -c -g -O1 -m32 arith.c1), the assembly code produced is a lot shorter:

00000000 <arith>:
int arith(int x, int y, int z)
{ 
   0:   8b 4c 24 04             mov    0x4(%esp),%ecx
   4:   8b 54 24 08             mov    0x8(%esp),%edx
   int t1 = x+y;
   8:   8d 04 11                lea    (%ecx,%edx,1),%eax
   int t2 = z+t1;
   b:   03 44 24 0c             add    0xc(%esp),%eax
   int t3 = x+4;
   int t4 = y * 48; 
   f:   8d 14 52                lea    (%edx,%edx,2),%edx
  12:   c1 e2 04                shl    $0x4,%edx
   int t5 = t3 + t4;
  15:   8d 54 11 04             lea    0x4(%ecx,%edx,1),%edx
   int rval = t2 * t5;
  19:   0f af c2                imul   %edx,%eax
   return rval;
}
  1c:   c3                      ret
like image 116
AusCBloke Avatar answered Oct 24 '22 17:10

AusCBloke


You can't reproduce the original source, you can only reproduce an equivalent source.

In your case the calculation for t2 can appear anywhere after t1 and before retval.

The source might even have been a single expression:

return (x+y+z) * ((x+4) + (y * 48));
like image 33
CodesInChaos Avatar answered Oct 24 '22 16:10

CodesInChaos


When reverse engineering, you don't care about the original source code line by line, you care about what it does. A side effect is that you see what the code does, not what the programmer intended the code to do.

like image 37
ninjalj Avatar answered Oct 24 '22 18:10

ninjalj


Decompilation is not perfectly achievable: there is some knowledge loss when going from the source code (where comments & names gave you a clue of the original programmer's intent) to binary machine code (where instructions are to be executed by the processor).

like image 42
Basile Starynkevitch Avatar answered Oct 24 '22 16:10

Basile Starynkevitch