Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git diff, which developers contributed most

I want a measurement for "active developers" in my git repository

git shortlog --summary --numbered oldrelease..newrelease give me a list of the most active committers, like this:

100  developer 1
 90  developer 2
 80  developer 3
  1  developer 4

But sometimes I see certain developers reverting other developers work (or improving it).

Now, I want to see which developers most actively contributed to the release. Give more weight to developers whose code changes are left in, and less weight to developers whose code was changed by others in the final release.

git diff oldrelease..newrelease

can give me all changed lines in the release.

I want to 'blame' all those lines to see the last developer that touched each changed line. How to do that?

Next, for all changed lines, I want to aggregate it so that I end up with a summary like this.

git funky_new_command oldrelease..newrelease

developer 2    added 450, removed 200 lines
developer 3    added 500, removed 100 lines
developer 1    added 4, removed 50 lines
developer 4    added 1, removed 0 lines

I think this will give a better idea for developers who over time contribute to the source repository, and not just who commits a lot of files.

like image 336
Jesper Rønn-Jensen Avatar asked Nov 01 '22 07:11

Jesper Rønn-Jensen


1 Answers

This is a slightly tricky problem to solve correctly, because you would also (presumably) want to reward authors who have removed lines of code too? The code I give below only detects what authors have the most code present in the current codebase which was added since some previous point in time.

git diff -z --name-only HEAD~5..HEAD
  | xargs -0 -n1 -- git blame HEAD~5..HEAD --
  | grep -v "\^"
  | sed 's/\(([^)]*\)([^)]*)\([^)]*)\)/\1 \2/'
  | sed 's/^[0-9a-f]* (\([^)]*\) \+[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] .*).*$/\1/'
  | sort | uniq -c | sort -nr

Let's see what goes on here:

git diff -z --name-only HEAD~5..HEAD lists all files that have changed since last time. We separate them by NULL, not newline (-z), to avoid word-splitting problems for xargs.

xargs -0 -n1 then consumes these files and calls git blame HEAD~5..HEAD -- for each file. The first -- is needed so that we can give -- to git blame. The second is there so we don't crash if someone gives us a filename that begins with a dash.

grep -v "\^" will keep only lines that have changed since the first revision given. The presence of this indicator is also why we don't use the machine-readable --porcelain output, which would have made parsing much easier (see below), but doesn't have this kind of indicator. A smarter script could have extracted what revision we started at and ignored any author lines that follow that revision, but we like to keep it "simple". A similar approach is outlined here.

The output at this stage looks something like this:

118caa41 (Jon Gjengset 2014-01-09 13:09:05 +0000 13) .FORCE:

We want to extract the author part of this, which is non-trivial given that the name may contain spaces. It is further complicated by the fact that some repositories have users whose names contain the symbols ( and ). So, to simplify our problem, we first get rid of these nested brackets with

sed 's/\(([^)]*\)([^)]*)\([^)]*)\)/\1 \2/'

This is not exactly pretty, and will break if some annoying person has unmatched ()s in their name, but we'll say it's okay for now.

To extract the name itself, we resort to this monster of a regex. It could be simplified further by using extended regexes, but I decided to try and keep sed compatability as much as I could:

sed 's/^[0-9a-f]* (\([^)]*\) \+[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] .*).*$/\1/'
       ^-- 1                 ^-- 2

We first make sure we find the first bracketed expression (the code line might also contain brackets) with 1. Then we match until we hit something that looks like the date seen in the middle of the brackets in the line above, at which point we have the author's name. Anything after that point can be removed.

The only thing left to do at this point is to sort and rank, which we do with sort | uniq -c | sort -nr.

And voilla - this command will find you an ordered list of the number of code lines added by an author in a revision list are present in the last revision.

A word of caution: You probably wouldn't want to use this for anything mission critical. Solutions based on regular expressions are notoriously prone to unexpected errors. Parcing the --porcelain output of git blame may be a more long-term solution.

like image 161
Jon Gjengset Avatar answered Nov 15 '22 05:11

Jon Gjengset