I would like to know if anyone can help me out with a problem I am having when studying one of the lecture slides from an introductory assembly class that I am taking in school. The problem I am having is not understanding the assembly, it is how exactly the C source code is ordered based on the assembly. I will post the snippet I am talking about and maybe it will be clearer what I am talking about.
C Source given:
int arith(int x, int y, int z)
{
int t1 = x+y;
int t2 = z+t1;
int t3 = x+4;
int t4 = y * 48;
int t5 = t3 + t4;
int rval = t2 * t5;
return rval;
}
Assembly given:
arith:
pushl %ebp
movl %esp,%ebp
movl 8(%ebp),%eax
movl 12(%ebp),%edx
leal (%edx,%eax),%ecx
leal (%edx,%edx,2),%edx
sall $4,%edx
addl 16(%ebp),%ecx
leal 4(%edx,%eax),%eax
imull %ecx,%eax
movl %ebp,%esp
popl %ebp
ret
I am just confused as to how I am supposed to be able to discern for example that the adding of z + t1
(z + x + y
) is listed on the second line(in the source) when in the assembly it comes after the y * 48
in the assembly code or for example that x + 4
is the 3rd line when in the assembly it is not even in a line by itself, its sort of mixed in with the last leal
statement. It makes sense to me when I have the source but I am supposed to be able to reproduce the source for a test and I do understand that the compiler optimizes things but if anyone has a way of thinking about the reverse engineering that could help me out I would greatly appreciate it if they could walk me through their thought process.
Thanks.
The Reverser allows you to reverse engineer compilable C code to a model, which you may want to do for the following reasons: To view the structure of the C code in Modeler. To develop the C code further in Modeler before regenerating the code. To move the C code to another platform, such as C++ or Java.
To do this, go to the Modules window and from the context menu of a . NET assembly, and then select the Decompile source code command. Visual Studio generates a symbol file for the assembly and then embeds the source into the symbol file. In a later step, you can extract the embedded source code.
Assembly programming for the reverse engineer is about learning how to write assembly. On top of this, it's also learning how the computer works in order to understand generated blocks of code and how the operating system deals with the user and the machine.
Done correctly, there is nothing wrong with reverse engineering, and it is not considered an “improper means” of gathering information, as defined by the Defend Trade Secrets Act (DTSA).
I've broken down the disassembly for you to show how the assembly was produced from the C source.
8(%ebp)
= x
, 12(%ebp)
= y
, 16(%ebp)
= z
arith:
Create the stack frame:
pushl %ebp
movl %esp,%ebp
x
into eax
, y
into edx
:
movl 8(%ebp),%eax
movl 12(%ebp),%edx
t1 = x + y
. leal
(Load effective address) will add edx
and eax
, and t1
will be in ecx
:
leal (%edx,%eax),%ecx
int t4 = y * 48;
in two steps below, multiply by 3, then by 16. t4
will eventually be in edx
:
Multiply edx
by 2, and add edx
to the result, ie. edx = edx * 3
:
leal (%edx,%edx,2),%edx
Shift left 4 bits, ie. multiply by 16:
sall $4,%edx
int t2 = z+t1;
. ecx
initially holds t1
, z
is at 16(%ebp)
, at the end of the instruction ecx
will be holding t2
:
addl 16(%ebp),%ecx
int t5 = t3 + t4;
. t3
was simply x + 4
, and rather than calculating and storing t3
, the expression of t3
is placed inline. This instruction essential does (x+4) + t4
, which is the same as t3
+ t4
. It adds edx
(t4
) and eax
(x
), and adds 4 as an offset to achieve that result.
leal 4(%edx,%eax),%eax
int rval = t2 * t5;
Fairly straight-forward this one; ecx
represents t2
and eax
represents t5
. The return value is passed back to the caller through eax
.
imull %ecx,%eax
esp
and ebp
:
movl %ebp,%esp
popl %ebp
ret
As others have said, you can't go exactly back to the source from the disassembly. It's up to the interpretation of the person reading the assembly to come up with equivalent C code.
Compile with debug information (-g
), which will embed the source:
gcc -c -g arith.c
If you're on a 64-bit machine, you can tell the compiler to create a 32-bit binary with the -m32
flag (I did so for the example below).
Use objdump to dump the object file with it's source interleaved:
objdump -d -S arith.o
-d
= disassembly, -S
= display source. You can add -M intel-mnemonic
to use the Intel ASM syntax if you prefer that over the AT&T syntax that your example uses.
Output:
arith.o: file format elf32-i386
Disassembly of section .text:
00000000 <arith>:
int arith(int x, int y, int z)
{
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 20 sub $0x20,%esp
int t1 = x+y;
6: 8b 45 0c mov 0xc(%ebp),%eax
9: 8b 55 08 mov 0x8(%ebp),%edx
c: 01 d0 add %edx,%eax
e: 89 45 fc mov %eax,-0x4(%ebp)
int t2 = z+t1;
11: 8b 45 fc mov -0x4(%ebp),%eax
14: 8b 55 10 mov 0x10(%ebp),%edx
17: 01 d0 add %edx,%eax
19: 89 45 f8 mov %eax,-0x8(%ebp)
int t3 = x+4;
1c: 8b 45 08 mov 0x8(%ebp),%eax
1f: 83 c0 04 add $0x4,%eax
22: 89 45 f4 mov %eax,-0xc(%ebp)
int t4 = y * 48;
25: 8b 55 0c mov 0xc(%ebp),%edx
28: 89 d0 mov %edx,%eax
2a: 01 c0 add %eax,%eax
2c: 01 d0 add %edx,%eax
2e: c1 e0 04 shl $0x4,%eax
31: 89 45 f0 mov %eax,-0x10(%ebp)
int t5 = t3 + t4;
34: 8b 45 f0 mov -0x10(%ebp),%eax
37: 8b 55 f4 mov -0xc(%ebp),%edx
3a: 01 d0 add %edx,%eax
3c: 89 45 ec mov %eax,-0x14(%ebp)
int rval = t2 * t5;
3f: 8b 45 f8 mov -0x8(%ebp),%eax
42: 0f af 45 ec imul -0x14(%ebp),%eax
46: 89 45 e8 mov %eax,-0x18(%ebp)
return rval;
49: 8b 45 e8 mov -0x18(%ebp),%eax
}
4c: c9 leave
4d: c3 ret
As you can see, without optimizations the compiler produces a larger binary than the example you have. You can play around with that and add a compiler optimization flag when compiling (ie. -O1
, -O2
, -O3
). The higher the optimization level, the more abstract the disassembly's going to seem.
For example, with just level 1 optimization (gcc -c -g -O1 -m32 arith.c1
), the assembly code produced is a lot shorter:
00000000 <arith>:
int arith(int x, int y, int z)
{
0: 8b 4c 24 04 mov 0x4(%esp),%ecx
4: 8b 54 24 08 mov 0x8(%esp),%edx
int t1 = x+y;
8: 8d 04 11 lea (%ecx,%edx,1),%eax
int t2 = z+t1;
b: 03 44 24 0c add 0xc(%esp),%eax
int t3 = x+4;
int t4 = y * 48;
f: 8d 14 52 lea (%edx,%edx,2),%edx
12: c1 e2 04 shl $0x4,%edx
int t5 = t3 + t4;
15: 8d 54 11 04 lea 0x4(%ecx,%edx,1),%edx
int rval = t2 * t5;
19: 0f af c2 imul %edx,%eax
return rval;
}
1c: c3 ret
You can't reproduce the original source, you can only reproduce an equivalent source.
In your case the calculation for t2
can appear anywhere after t1
and before retval
.
The source might even have been a single expression:
return (x+y+z) * ((x+4) + (y * 48));
When reverse engineering, you don't care about the original source code line by line, you care about what it does. A side effect is that you see what the code does, not what the programmer intended the code to do.
Decompilation is not perfectly achievable: there is some knowledge loss when going from the source code (where comments & names gave you a clue of the original programmer's intent) to binary machine code (where instructions are to be executed by the processor).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With