I'm looking for a way to compute a good edit distance between the contents of any two commits.
The best I've found is to derive something from the output of
git diff <commit-ish> <commit-ish> --numstat
...but anything I can come up using this method would be a very crude proxy for edit distance.
Is there anything better?
I think your best bet here is to use an outside tool for calculating Levenshtein distance. For example Perl's Text::Levenshtein
module.
For example, somewhat hackily:
#!/bin/sh
COMMIT_ONE=$1
COMMIT_TWO=$2
FILES_AFFECTED=$(git diff $COMMIT_ONE $COMMIT_TWO --numstat | awk '{ print $3 }')
TOTAL_LEV_DIST=0
for FILE in $FILES_AFFECTED; do
CONTENTS_ONE=$(git show $COMMIT_ONE:$FILE)
CONTENTS_TWO=$(git show $COMMIT_TWO:$FILE)
LEV_DIST=$(perl -MText::Levenshtein -e 'my ($str1, $str2) = @ARGV; print Text::Levenshtein::distance($str1, $str2);' "$CONTENTS_ONE" "$CONTENTS_TWO")
TOTAL_LEV_DIST=$(($TOTAL_LEV_DIST + $LEV_DIST))
done
echo $TOTAL_LEV_DIST
Which seems to do the trick:
$ git diff HEAD HEAD~3 --numstat
0 5 Changes
1 3 dist.ini
$ ./lev_dist_git_commits.sh HEAD HEAD~3
230
$ ./lev_dist_git_commits.sh HEAD HEAD
0
Note: You can install Text::Levenshtein::XS
for a speed boost if you have a C compiler and if speed is important. On my computer that reduced the time from 1.5s to 0.05s.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With