I have an application which creates .text
segment dumps of win32 processes. Then it divides the code on basic blocks. Basic block is a set of instructions which are executed always one after another (jumps are always the last instructions of such basic blocks). Here is an example:
Basic block 1
mov ecx, dword ptr [ecx]
test ecx, ecx
je 00401013h
Basic block 2
mov eax, dword ptr [ecx]
call dword ptr [eax+08h]
Basic block 3
test eax, eax
je 0040100Ah
Basic block 4
mov edx, dword ptr [eax]
push 00000001h
mov ecx, eax
call dword ptr [edx]
Basic block 5
ret 000008h
Now I would like to group such basic blocks in functions - say which basic blocks form a function. What's the algorithm? I have to remember that there might be many ret
instructions inside one function. How to detect fast_call
functions?
Using the objdump Command The objdump command is generally used to inspect the object files and binary files. It prints the different sections in object files, their virtual memory address, logical memory address, debug information, symbol table, and other pieces of information.
When referring to hardware, to disassemble is to break down a device into separate parts. A device may be disassembled to help determine a problem, to replace a part, or take the parts and use them in another device or sell them individually.
Disassembling an ELF-formatted fileUse the --disassemble option to display a disassembled version of the image to stdout . If you use this option with the --output destination option, you can reassemble the output file with armasm. You can use this option to disassemble either an ELF image or an ELF object file.
The disassembly listing file is generated by the linker at build time. It lists the entire program (all files and functions) in a single file. Each line of C code is followed by one or more assembly instructions associated with that line of code. This file is available immediately after the project is built.
The simplest algorithm for grouping blocks into functions would be:
call some_address
instructionsret
, you're done with the function, elseret
. You'll need to recognize jumps that organize loops so your program itself does not hang by entering an infinite loopProblems:
call [some_address]
instead of call some_address
jump some_address
instead of call some_address
immediately followed by ret
call some_address
can be simulated with a combination of push some_address
+ ret
OR push some_address
+ jmp some_other_address
You may use some heuristic to determine where functions start by looking for the most common prolog instruction sequence:
push ebp
mov ebp, esp
Again, this may not work if functions are compiled with the frame pointer suppressed (i.e. they'd use esp
instead of ebp
to access their parameters on the stack, it's possible).
The compiler (e.g. MSVC++) may also pad the inter-function space with the int 3
instruction and that too can serve as a hint for an upcoming function beginning.
As for differentiating between the various calling conventions, it's perhaps the easiest to look at the symbols (of course, if you have them). MSVC++ generates different name prefixes and suffixes, e.g.:
If you cannot extract this information from the symbols, you must analyze code to see how parameters are passed to functions and whether functions or their callers remove them from the stack.
You could use the presence of enter
to denote the beginning of a function, or certain code which sets up a frame.
push ebp
mov ebp, esp
sub esp, (bytes for "local" stack space)
Later you'll find the opposite code (or leave
) before a call to ret
:
mov esp, ebp
pop ebp
You can also use the number of bytes for local stack space to identify local variables.
Identifying thiscall
, fastcall
, etc, will take some analysis of the code just prior to call
s which use the initial location and an evaluation of the registers used/cleaned up.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With