Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assembly-level function fingerprint

I would like to determine, whether two functions in two executables were compiled from the same (C) source code, and would like to do so even if they were compiled by different compiler versions or with different compilation options. Currently, I'm considering implementing some kind of assembler-level function fingerprinting. The fingerprint of a function should have the properties that:

  1. two functions compiled from the same source under different circumstances are likely to have the same fingerprint (or similar one),
  2. two functions compiled from different C source are likely to have different fingerprints,
  3. (bonus) if the two source functions were similar, the fingerprints are also similar (for some reasonable definition of similar).

What I'm looking for right now is a set of properties of compiled functions that individually satisfy (1.) and taken together hopefully also (2.).

Assumptions

Of course that this is generally impossible, but there might exist something that will work in most of the cases. Here are some assumptions that could make it easier:

  • linux ELF binaries (without debugging information available, though),
  • not obfuscated in any way,
  • compiled by gcc,
  • on x86 linux (approach that can be implemented on other architectures would be nice).

Ideas

Unfortunately, I have little to no experience with assembly. Here are some ideas for the abovementioned properties:

  • types of instructions contained in the function (i.e. floating point instructions, memory barriers)
  • memory accesses from the function (does it read/writes from/to heap? stack?)
  • library functions called (their names should be available in the ELF; also their order shouldn't usually change)
  • shape of the control flow graph (I guess this will be highly dependent on the compiler)

Existing work

I was able to find only tangentially related work:

  • Automated approach which can identify crypto algorithms in compiled code: http://www.emma.rub.de/research/publications/automated-identification-cryptographic-primitives/
  • Fast Library Identification and Recognition Technology in IDA disassembler; identifies concrete instruction sequences, but still contains some possibly useful ideas: http://www.hex-rays.com/idapro/flirt.htm

Do you have any suggestions regarding the function properties? Or a different idea which also accomplishes my goal? Or was something similar already implemented and I completely missed it?

like image 892
b42 Avatar asked Sep 02 '11 12:09

b42


1 Answers

FLIRT uses byte-level pattern matching, so it breaks down with any changes in the instruction encodings (e.g. different register allocation/reordered instructions).

For graph matching, see BinDiff. While it's closed source, Halvar has described some of the approaches on his blog. They even have open sourced some of the algos they do to generate fingerprints, in the form of BinCrowd plugin.

like image 56
Igor Skochinsky Avatar answered Nov 02 '22 00:11

Igor Skochinsky