Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.

As a start, I examined the output of

git log --patch -M --find-renames --find-copies-harder --function-context -- *.c

I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.

The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:

PATTERNS("cpp",
         /* Jump targets or access declarations */
         "!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
         /* C/++ functions/methods at top level */
         "^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
         /* compound type at top level */
         "^((struct|class|enum)[^;]*)$",
         /* -- */
         "[a-zA-Z_][a-zA-Z0-9_]*"
         "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
         "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),

This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.

This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?

like image 973
Greg Bacon Avatar asked Mar 16 '12 20:03

Greg Bacon


1 Answers

I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...

Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.

This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )

So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'

I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).

I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.

So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)

This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.

So, the regular expressions might be kept very simple :-)

Assuming all of this is straightforward, what might this approach fail to give you?

Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.

like image 135
gbulmer Avatar answered Oct 19 '22 06:10

gbulmer