Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficiently rewriting (rebase -i) a lot of history with git

I have a git repository with about 3500 commits and 30,000 distinct files in the latest revision. It represents about 3 years of work from multiple people and we have received permission to make it all open-source. I am trying hard to release the entire history, instead of just the latest version. To do this I am interested in "going back in time" and inserting a license header at the top of files when they are created. I actually have this working, but it takes about 3 days running entirely out of a ramdisk, and still does require a little bit of manual intervention. I know it can be a lot faster, but my git-fu is not quite up to the task.

The question: how can I accomplish the same thing a lot faster?

What I currently do (automated in a script, but please bear with me...):

  1. Identify all of the commits where a new file was added to the repository (there are just shy of 500 of these, fwiw):

    git whatchanged --diff-filter=A --format=oneline
    
  2. Define environment variable GIT_EDITOR to be my own script that replaces pick with edit only a single time on the first line of the file (you will see why shortly). This is the core of the operation:

    perl -pi -e 's/pick/edit/ if $. == 1' $1
    
  3. For each commit from the output of git whatchanged above, invoke an interactive rebase starting just before the commit that added the file:

    git rebase -i decafbad001badc0da0000~1
    

My custom GIT_EDITOR (that perl one-liner) changes pick to edit and we are dropped to a shell to make changes to the new file. Another simple header-inserter script looks for a known unique pattern in the header that I'm trying to insert (only in known file types (*.[chS] for me)). If it's not there, it inserts it, and git add's the file. This naive technique has no knowledge of which files were actually added during the present commit, but it ends up doing the right thing and being idempotent (safe to run multiple times against the same file), and is not where this whole process is bottlenecked anyways.

At this point we're happy that we've updated the current commit, and invoke:

    git commit --amend
    git rebase --continue

The rebase --continue is the expensive part. Since we invoke a git rebase -i once for every revision in the output of whatchanged, that's a lot of rebasing. Almost all of the time during which this script runs is spent watching the "Rebasing (2345/2733)" counter increment.

It's also not just slow. There are periodically conflicts that must be addressed. This can happen in at least these cases (but likely more): (1) when a "new" file is actually a copy of an existing file with some changes made to its very first lines (e.g., #include statements). This is a genuine conflict but can be resolved automatically in most cases (yep, have a script that deals with that). (2) when a file is deleted. This is trivially resolvable by just confirming that we want to delete it with git rm. (3) there are some places where it seems like diff just behaves badly, e.g., where the change is only the addition of a blank line. Other more legitimate conflicts require manual intervention but on the whole they are not the biggest bottleneck. The biggest bottleneck is absolutely just sitting there staring at "Rebasing (xxxx/yyyy)".

Right now the individual rebases are initiated from newer commits to older commits, i.e., starting from the top of the output of git whatchanged. This means that the very first rebase affects yesterday's commits, and that eventually we'll be rebasing commits from 3 years ago. Going from "newer" to "older" seems counter-intuitive, but so far I'm not convinced that it matters unless we change more than one pick to an edit when invoking the rebase. I am afraid to do this because conflicts do arrive, and I don't want to deal with a tidal wave of conflict ripples from trying to rebase everything in one go. Maybe somebody knows a way to avoid that? I haven't been able to come up with one.

I started looking at the internal workings of git objects 1! It does seem like there should be a much more efficient way to walk the object graph and just make the changes that I want to make.

Please note that this repository came from an SVN repository where we effectively made no use of tags or branches (I already git filter-branched them away), so we do have the convenience of a straight-line history. No git branches or merges.

I'm sure I've left out some critical information, but this post already seems excessively long. I will do my best to provide more information as requested. In the end I may need to just publish my various scripts, which is a possibility. It is my objective to figure out how to rewrite history thusly in a git repository; not to debate other viable methods of licensing and code release.

Thanks!

Update 2012-06-17: Blog post with all the gory details.

like image 511
jonny0x5 Avatar asked Jun 06 '12 15:06

jonny0x5


1 Answers

Using

git filter-branch -f --tree-filter '[[ -f README ]] && echo "---FOOTER---" >> README' HEAD

Would essentially add a footer line to the README file, and the history would look like it has been there since file creation, i'm not sure if it will be efficient enough for you but it is the correct way to do it.

Craft a custom script and you'll probably end up with a good project history, doing too much "magic" (rebase, perl, scripted editors, etc) may end up losing or changing project history in unexpected ways.

jon (the OP) used this basic pattern to achieve the goal with significant simplification and speedup.

git filter-branch -d /dev/shm/git --tree-filter \
'perl /path/to/find-add-license.pl' --prune-empty HEAD

A few performance-critical observations.

  • Using the -d <directory> parameter pointing to a ramdisk directory (like /dev/shm/foo) will improve the speed significantly.

  • Do all changes from a single script, using its built-in language features, the forks done while using small utilities (like find), will slow the process many times. Avoid this:

    git filter-branch -d /dev/shm/git --tree-filter \
    'find . -name "*.[chS]" -exec perl /path/to/just-add-license.pl \{\} \;' \
    --prune-empty HEAD
    

This is a sanitized version of the perl script the OP used:

#!/usr/bin/perl -w
use File::Slurp;
use File::Find;

my @dirs = qw(aDir anotherDir nested/DIR);
my $header = "Please put me at the top of each file.";

foreach my $dir(@dirs) {
  if (-d $dir) {
    find(\&Wanted, $dir);
  }
}

sub Wanted {
  /\.c$|\.h$|\.S$/ or return; # *.[chS]
  my $file = $_;
  my $contents = read_file($file);
  $contents =~ s/\r\n?/\n/g; # convert DOS or old-Mac line endings to Unix
  unless($contents =~ /Please put me at the top of each file\./) {
    write_file( $file, {atomic => 1}, $header, $contents );
  }
}
like image 129
KurzedMetal Avatar answered Sep 19 '22 20:09

KurzedMetal