Symbolicating stripped binary using symbols from older debug version (inexact graph matching)

Question

I have binary A, which is a debug build with accompanying symbols -- built many years ago. I also have binary B, a release build without accompanying symbols and is much more recent. I am looking for the most efficient method of matching the symbols from binary A to potential candidates in binary B.

Given that the debug build is considerably bigger (does more input validation, prints more things to stderr, etc.) and that functions invariably change over time, I figure that trying to fingerprint the individual functions will be wasted time.

Therefore, I've decided -- quite out of thin air, so I could be barking up the wrong tree -- that the best way to fingerprint the functions is to create call graphs of both binaries and try to match the vertices (i.e. the functions).

I have already done some preprocessing, so I have the following data structures:

# binary A
[[60, 60, 8734], # function 0 is called by functions 60 (twice) and 8734
 [193, 441, 505], # function 1 is called by functions 193, 441 and 505
 [193, 742],
 [23],  
 [21],  
 [21],  
 [26],  
 [26, 1508, 1509, 1573],  
 [24],  
 [25],
 ...] # (~10k functions)

# binary B
[[8999], # function 0 is called by function 8999
 [9016], # function 1 is called by function 9016
 [1126], 
 [7904, 7904, 7913], 
 [182, 336, 396, 396], 
 [9010], 
 [407], 
 [182, 632], 
 [20], 
 [24],
 ...] # (~10k functions)

An important note to take away is that there is no correspondence between function "0" in binary A and function "0" in binary B. These are arbitrary IDs I have assigned to each function in each binary.

The next step is that which confounds me. My algorithm-fu is very weak and I cannot think of a clever way to proceed. My (very limited) understanding is that to solve this, I would want to employ some form of inexact graph matching. In other words, which set of mappings Ai -> Bi would maximise the similarity of the two call-graphs?

Given that there are additional debugging functions in binary A and the obvious fact that programs evolve over time, there is likely no exact match. Ideally, I would want output of the form:

[[(37, 0.998), (8432, 0.912), (442, 0.75)], # matching-ness of function "0" in binary A with function "37" in binary B is 0.998, second most likely candidate is function "8432" in binary B with score 0.912, etc.
 [(42, 0.973), (7751, 0.788)], # matching-ness of function "1" in binary A with function "42" in binary B is 0.973, second most likely candidate is function "7751" in binary B with score 0.788, etc.
 [(4579, 0.996), (123, 0.934)], 
 ...] # around ~10k mappings

In reality, I would be happy with if I had to make do with only one candidate and no ranking was provided, but one can dream.

Any SO-goers have an idea of where I should start?

j_random_hacker · Accepted Answer

Certainly an interesting problem, though I suspect it will be hard to solve. It appears to be an instance of approximate graph isomorphism on directed graphs. I didn't find much googling for this, but here's some software for solving for undirected general graphs, a more general case which is NP hard.

Aligning Executable Code

I think the most practical thing you may be able to do is to forget about the runtime information and simply take the executable code sections of each version and use a global alignment algorithm (e.g. Needleman-Wunsch, although there do exist much faster but less accurate algorithms) on them that:

Treats entire instructions as characters (this will require building a rudimentary disassembler)
Disregards address components of instructions entirely
Downweights deletions (assuming the "first" file is the debug version, which we expect to be larger)
Upweights matches of CALL instructions, and possibly other "reliable" sequences of instructions.

Assuming the order that the functions appear in the executables hasn't changed too much (which it won't have, unless the optimised version has used some optimisation that gets the linker to place functions that call each other near each other), this should get you a nice first approximation.

The Assignment Problem (Bipartite Maximum Matching)

Alternatively, if you can find a way (and my intuition suggests it would need to be an iterative approach, along the lines of how PageRank decides the value of a webpage) to "score" the likelihood that a function f in the debug version corresponds to a function g in the optimised version, then yes you could use a graph matching technique. In this case the vertices of the graph would be all functions in both versions, and there would be a weighted edge between every function from the debug version and every function from the optimised version, with the weight determined by your scoring system.

This graph would be bipartite because there will never be an edge between 2 functions in the same version. That means it is an instance of the Assignment Problem, for which quite good (and not too complicated) algorithms exist.

However, the missing piece here is the means of deciding the weights for each pairing. One approximate way of doing this would be to build a vector counting the number of immediate children, grandchildren, great-grandchildren and so on for each function. You could then compare these vectors using any distance measure you like. But I expect doing this here will not work well, because we already expect the debug version to contain far more calls to functions than the optimised version.

Using the Full Call Tree

If you have access to the entire call tree for both, this will give you more information: the sequence of calls made within a function, as well as knowledge of the exact hierarchy of calls. You might then be able to build a "signature" for each function by doing the using just the latter:

Extract the list of distinct functions called by a given function
Label the 1st-called function 1, the 2nd-called distinct function 2, etc.
The signature is just this sequence of function labels in the order that they are called in the function.

Now, Levenshtein distance can be used to compare 2 signatures. For more accuracy at the expense of more computation, you could use a variation in which up to k distinct functions in the debug version are allowed to be deleted, for some small k (e.g. k = 3), and the best Levenshtein distance is taken over all such "slimmed down" versions, with a small penalty attached that is proportional to the number of functions deleted.

Igor Skochinsky · Answer

If you can afford it, IDA Pro + BinDiff.

Symbolicating stripped binary using symbols from older debug version (inexact graph matching)

Tags:

algorithm

graph

reverse-engineering

call-graph

Sedate Alien

2 Answers

Aligning Executable Code

The Assignment Problem (Bipartite Maximum Matching)

Using the Full Call Tree

j_random_hacker

Igor Skochinsky

Recent Activity

Donate For Us

Symbolicating stripped binary using symbols from older debug version (inexact graph matching)

Tags:

algorithm

graph

reverse-engineering

call-graph

Sedate Alien

2 Answers

Aligning Executable Code

The Assignment Problem (Bipartite Maximum Matching)

Using the Full Call Tree

j_random_hacker

Igor Skochinsky

Related questions

Recent Activity

Donate For Us