Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to divide disassembled C code to functions?

I have an application which creates .text segment dumps of win32 processes. Then it divides the code on basic blocks. Basic block is a set of instructions which are executed always one after another (jumps are always the last instructions of such basic blocks). Here is an example:

Basic block 1
    mov ecx, dword ptr [ecx]
    test ecx, ecx
    je 00401013h

Basic block 2
    mov eax, dword ptr [ecx]
    call dword ptr [eax+08h]

Basic block 3
    test eax, eax
    je 0040100Ah

Basic block 4
    mov edx, dword ptr [eax]
    push 00000001h
    mov ecx, eax
    call dword ptr [edx]

Basic block 5
    ret 000008h

Now I would like to group such basic blocks in functions - say which basic blocks form a function. What's the algorithm? I have to remember that there might be many ret instructions inside one function. How to detect fast_call functions?

like image 663
Adam Sznajder Avatar asked Feb 07 '13 16:02

Adam Sznajder


People also ask

Which command is used to disassemble C?

Using the objdump Command The objdump command is generally used to inspect the object files and binary files. It prints the different sections in object files, their virtual memory address, logical memory address, debug information, symbol table, and other pieces of information.

What is the function of disassembling?

When referring to hardware, to disassemble is to break down a device into separate parts. A device may be disassembled to help determine a problem, to replace a part, or take the parts and use them in another device or sell them individually.

How do I take apart my ELF file?

Disassembling an ELF-formatted fileUse the --disassemble option to display a disassembled version of the image to stdout . If you use this option with the --output destination option, you can reassemble the output file with armasm. You can use this option to disassemble either an ELF image or an ELF object file.

What is disassembly listing?

The disassembly listing file is generated by the linker at build time. It lists the entire program (all files and functions) in a single file. Each line of C code is followed by one or more assembly instructions associated with that line of code. This file is available immediately after the project is built.


2 Answers

The simplest algorithm for grouping blocks into functions would be:

  1. note all addresses to which calls are made with call some_address instructions
  2. if the first block after such an address ends with ret, you're done with the function, else
  3. follow the jump in the block to another block and so on until you've followed all possible execution paths (remember about conditional jumps, each of which splits a path into two) and all the paths have finished with ret. You'll need to recognize jumps that organize loops so your program itself does not hang by entering an infinite loop

Problems:

  1. a number of calls can be made indirectly by reading function pointers from memory, e.g. you'd have call [some_address] instead of call some_address
  2. some indirect calls can be made to calculated addresses
  3. functions that call other functions before returning may have jump some_address instead of call some_address immediately followed by ret
  4. call some_address can be simulated with a combination of push some_address + ret OR push some_address + jmp some_other_address
  5. some functions may share code at their end (e.g. they have different entry points, but one or more exit points are the same)

You may use some heuristic to determine where functions start by looking for the most common prolog instruction sequence:

push ebp
mov ebp, esp

Again, this may not work if functions are compiled with the frame pointer suppressed (i.e. they'd use esp instead of ebp to access their parameters on the stack, it's possible).

The compiler (e.g. MSVC++) may also pad the inter-function space with the int 3 instruction and that too can serve as a hint for an upcoming function beginning.

As for differentiating between the various calling conventions, it's perhaps the easiest to look at the symbols (of course, if you have them). MSVC++ generates different name prefixes and suffixes, e.g.:

  • _function - cdecl
  • _function@number - stdcall
  • @function@number - fastcall

If you cannot extract this information from the symbols, you must analyze code to see how parameters are passed to functions and whether functions or their callers remove them from the stack.

like image 55
Alexey Frunze Avatar answered Oct 23 '22 14:10

Alexey Frunze


You could use the presence of enter to denote the beginning of a function, or certain code which sets up a frame.

push ebp
mov  ebp, esp
sub  esp, (bytes for "local" stack space)

Later you'll find the opposite code (or leave) before a call to ret:

mov esp, ebp
pop ebp

You can also use the number of bytes for local stack space to identify local variables.

Identifying thiscall, fastcall, etc, will take some analysis of the code just prior to calls which use the initial location and an evaluation of the registers used/cleaned up.

like image 45
user7116 Avatar answered Oct 23 '22 13:10

user7116