Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding where source has branched from git

I have a git repository (covering more or less project history) and separate sources (just a tarball with few files) which have forked some time ago (actually somewhere in 2004 or 2005).

The sources from tarball have undergone quite a lot of changes from which I'd like to incorporate some. Now the question is - how to find out what was actually the branch point for the changed sources to get minimal diff of what has happened there.

So what I basically want is to find place in git history, where the code is most similar to the tarball of sources I have. And I don't want to do that manually.

It is also worth mentioning that the changed sources include only subset of files and have split some files into more. However the code which is in there seem to get only small modifications and several additions.

If you want to play with that yourself, the tarball with sources is here and Git is hosted at Gitorious: git://gitorious.org/gammu/mainline.git

like image 457
Michal Čihař Avatar asked Jun 23 '10 15:06

Michal Čihař


2 Answers

In the general case, you'd actually have to examine every single commit, because you have no way of knowing if you might have a huge diff in one, small diff the next, then another huge diff, then a medium diff...

Your best bet is probably going to be to limit yourself to specific files. If you consider just a single file, it should not take long to iterate through all the versions of that file (use git rev-list <path> to get a list, so you don't have to test every commit). For each commit which modified the file, you can check the size of the diff, and fairly quickly find a minimum. Do this for a handful of files, hopefully they'll agree!

The best way to set yourself up for the diffing is to make a temporary commit by simply copying in your tarball, so you can have a branch called tarball to compare against. That way, you could do this:

git rev-list path/to/file | while read hash; do echo -n "$hash "; git diff --numstat tarball $hash path/to/file; done

to get a nice list of all the commits with their diff sizes (the first three columns will be SHA1, number of lines added, and number of lines removed). Then you could just pipe it on into awk '{print $1,$2+$3}' | sort -n -k 2, and you'd have a sorted list of commits and their diff sizes!

If you can't limit yourself to a small handful of files to test, I might be tempted to hand-implement something similar to git-bisect - just try to narrow your way down to a small diff, making the assumption that in all likelihood, commits near to your best case will also have smaller diffs, and commits far from it will have larger diffs. (Somewhere between Newton's method and a full on binary/grid search, probably?)

Edit: Another possibility, suggested in Douglas' answer, if you think that some files might be identical to those in some commit, is to hash them using git-hash-object, and then see what commits in your history have that blob. There's a question with some excellent answers about how to do that. If you do this with a handful of files - preferably ones which have changed frequently - you might be able to narrow down the target commit pretty quickly.

like image 129
Cascabel Avatar answered Oct 14 '22 19:10

Cascabel


Not a great solution, but to get a guess of which revisions it might be: Assume that some of the files in the tar ball have not been changed since they were branched. Run git hash object against each file in the tar ball, then search for those files in the repository using git show. Then try and find the commits under which these files were included, possibly using git whatchanged. The answer to your question might then be the commit with the most common files, but it'll still be a bit hit and miss.

like image 36
Douglas Avatar answered Oct 14 '22 20:10

Douglas