Any way to use filter-branch in a incremental sense

Question

Is there any way to use filter-branch in a incremental manner on a branch?

roughly speaking like this (but this isn't actually working):

git checkout -b branchA origin/branchA  
git branch headBranchA  
# inital rewrite   
git filter-branch ... -- branchA  
git fetch origin  
# incremental rewrite  
git filter-branch ... -- headBranchA..origin/branchA  
git merge origin/branchA

torek · Accepted Answer

I'm not sure what you're really trying to achieve, so what I will say here is "yes, sort of, but probably not what you're thinking and it might not help you achieve your goal, whatever that is".

It's important to understand here not just what filter-branch does, but also, to some extent, how it does it.

Background (to make this answer useful to others)

A git repository contains some commit-graph(s). These are found by taking some starting commit nodes, found via external references—mostly branch and tag names, but also annotated tags which I'll just sort of gloss over as not particularly important to this case—and then using those starting nodes to find more nodes, until all "reachable" nodes have been found.

Each commit has zero or more "parent commits". Most ordinary commits have one parent; merges have two or more parents. A root commit (such as the initial commit in a repository) has no parents.

Branch names point to one particular commit, which points back to its parent(s), and so on.

  B-C-D
 /     \
A---E---F   <-- master
 \
  G     J   <-- branch1
   \   /
    H-I-K   <-- branch2

Branch name master points to commit F (which is a merge commit). The names branch1 and branch2 point to commits J and K respectively.

Let's also note that, because commits point to their parents, the "reachable set" from name master is A B C D E F, the set for branch1 is A G H I J, and the set for branch2 is A G H I K.

The "true name" of each commit node is its SHA-1, which is a cryptographic checksum of the contents of the commit. The contents include SHA-1 checksums of the corresponding work-tree contents and the SHA-1s of the parent commits. Thus, if you go to copy a commit and change nothing (not one single bit) you get the same SHA-1 back and hence wind up with the same commit; but if you change even a single bit (including, e.g., changing the spelling of the committer's name, any time stamps, or any part of the associated work-tree), you get a new, different commit.

`git rev-parse` and `git rev-list`

These two commands are quite central to most git operation.

The rev-parse command turns any valid git revision specifier into a commit-ID. (It also has a lot of what we might call "assistance modes", that allow writing most git commands as shell scripts—and git filter-branch is in fact a shell script.)

The rev-list command turns a revision range (also in gitrevisions) into a list of commit-IDs. Given just a branch name, it finds the set of all revisions reachable from that branch, so with the example commit graph above, given branch2, it lists the SHA-1 values for commits A, G, H, I, and K. (It defaults to listing them in reverse chronological order, but can be told to list them in "topographic order", which is important to filter-branch, not that I intend to get that deep into the details here.)

In this case, though, you will want to use "commit limiting": given a revision range, like the A..B syntax, or given things like B ^A, git rev-list limits its output rev-sets to commits that are reachable from B, but not reachable from A. Hence, given branch2~3..branch2—or euivalently, branch2 ^branch2~3—it lists the SHA-1 values for H, I, and K. This is because branch2~3 names commit G, so commits A and G are pruned away from the reachable set.

`git filter-branch`

The filter-branch script is fairly complex but summarizing its action on "ref names given on the command line" is not too hard.

First, it uses git rev-parse to find the actual head revisions of the branch or branches to be filtered. It uses it twice, in fact: once to get SHA-1 values, and once to get names. Given, e.g., headBranchA..origin/branchA, it needs to get the "true full name" refs/remotes/origin/branchA:

git rev-parse --revs-only --symbolic-full-name headBranchA..origin/branchA

will print:

refs/remotes/origin/branchA
^refs/heads/headBranchA

The filter-branch script discards any ^-prefixed results to get a list of "positive ref names"; these are what it intends to rewrite, in the end.

These are the "positive refs" described in the git-filter-branch manual.

Then it uses git rev-list to get a complete list of commit SHA-1s on which to apply the filters. This is where the headBranchA..origin/branchA limiting syntax comes in: the script now knows to work only on commits reachable from origin/branchA, but not from headBranchA.

Once it has the list of commit IDs, git filter-branch actually applies the filters. These make new commits.

As always, if the new commits are exactly identical to the original commits, the commit-IDs are unchanged. If filter-branch is to be useful, though, presumably at some point, some commits are changed, giving them new SHA-1s. Any immediate children of those commits have to acquire new parent IDs, so those commits are also changed, and those changes propagate down to the ultimate branch-tips.

Finally, having applied the filters to all the listed commits, the filter-branch script updates the "positive refs".

The next part depends on your actual filters. Let's just assume for illustration that your filter changes the spelling of an author name on every commit, or changes the time-stamp on every commit, or some such, so that every commit is rewritten, except for some reason it leaves the root commit unchanged, so that the new branch and the old one do have a common ancestor.

We start with this:

git checkout -b branchA origin/branchA

(you are now on branchA, i.e., HEAD contains ref: refs/heads/branchA)

git branch headBranchA

(this makes another branch label pointing to the current HEAD commit but does not alter HEAD)

# inital rewrite
git filter-branch ... -- branchA

The "positive ref" in this case is branchA. The commits to be rewritten are every commit reachable from branchA, i.e., all the o nodes below (starting commit graph made up for illustration here), except for the root commit R:

R-o-o-x-x-x   <-- master
     \
      o-o-o   <-- headBranchA, HEAD=branchA, origin/branchA

Every o commit is copied, and branchA is moved to point to the last new one:

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- headBranchA, origin/branchA
 \
  *-*-*-*-*   <-- HEAD=branchA

Later, you go to pick up new stuff from remote origin:

git fetch origin

Let's say this adds commits labeled n (and I'll just add one):

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- headBranchA
|          \
|           n <-- origin/branchA
 \
  *-*-*-*-*   <-- HEAD=branchA

Here's where things go wrong:

git filter-branch ... -- headBranchA..origin/branchA

The "positive ref" here is origin/branchA, so that's what will be moved. The commits selected by the rev-list are just those marked n, which is what you want. Let's spell the rewritten commit N (uppercase) this time:

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- headBranchA
|         |\
|         | n [semi-abandoned - filter-branch writes refs/original/...]
|          \
|           N <-- origin/branchA
 \
  *-*-*-*-*   <-- HEAD=branchA

And now you attempt to git merge origin/branchA, which means to git merge commit N, which requires finding the merge base between the * chain and commit N ... and that's commit R.

This is not, I assume, what you meant to do at all.

I suspect what you want to do is, instead, cherry-pick commit N onto the * chain. Let's draw that in:

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- headBranchA
|         |\
|         | n [semi-abandoned - filter-branch writes refs/original/...]
|          \
|           N <-- origin/branchA
 \
  *-*-*-*-*-N'<-- HEAD=branchA

This part is OK, but it's left a mess for the future. It turns out you don't actually want commit N at all, and you don't want to move origin/branchA, because (I assume) you'd like to be able to repeat the git fetch origin step later. So let's "undo" this and try something different. Let's drop the headBranchA label entirely and start with this:

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- origin/branchA
 \
  *-*-*-*-*   <-- HEAD=branchA

Let's add a temporary marker for the commit to which origin/branchA points, and run git fetch origin, so that we get commit n:

R-o-o-x-x-x     <-- master
|    \     .--------temp
|     o-o-o-n   <-- origin/branchA
 \
  *-*-*-*-*     <-- HEAD=branchA

Now let's copy commit n to branchA, and while we're copying it, modify it too (doing whatever mods you would do with git filter-branch) to get a commit we'll just call N:

R-o-o-x-x-x     <-- master
|    \     .--------temp
|     o-o-o-n   <-- origin/branchA
 \
  *-*-*-*-*-N    <-- HEAD=branchA

When this is done we erase temp and we're ready to repeat the cycle.

Making it work

That leaves several problems. The most obvious is: how do we copy n (or several/many ns) and then modify them? Well, the easy way, assuming you have your filter-branch already working, is to use git cherry-pick to copy them, then git filter-branch to filter them.

This only works if the cherry-pick step is not going to run into tree-difference issues, so it depends on what your filter does:

# all of this to be done while on branchA
git tag temp origin/branchA
git fetch origin # pick up `n` commit(s)

git tag temp2    # mark the point for filtering
git cherry-pick temp..origin/branchA
git filter-branch ... -- temp2..branchA

# remove temporary markers
git tag -d temp temp2

What if your filter-branch alters the tree, so that this method won't always work? Well, we can resort to applying the filter directly to the n commits, giving n' commits, then copy the n' commits. Those (n'') commits are the ones that will live on the local (filtered) branchA. The n' commits are not needed once they've been copied, so we discard them.

# lay down temporary marker as before, and fetch
git tag temp origin/branchA
git fetch origin

# now make a new branch, just for filtering
git checkout -b temp2 origin/branchA
git filter-branch ... -- temp..temp2
# the now-altered new branch, temp..temp2, has filtered commits n'

# copy n' commits to n'' commits on branchA
git checkout branchA
git cherry-pick temp..temp2

# and finally, delete the temporary marker and the temporary branch
git tag -d temp
git branch -D temp2 # temp2 requires a force-delete

Any way to use filter-branch in a incremental sense

Tags:

git

git-rewrite-history

Gert

2 Answers

Background (to make this answer useful to others)

`git rev-parse` and `git rev-list`

`git filter-branch`

Making it work

Other problems

torek

filter-branch: return 2 when nothing to rewrite

VonC

Recent Activity

Donate For Us

Any way to use filter-branch in a incremental sense

Tags:

git

git-rewrite-history

Gert

2 Answers

Background (to make this answer useful to others)

git rev-parse and git rev-list

git filter-branch

Making it work

Other problems

torek

filter-branch: return 2 when nothing to rewrite

VonC

Related questions

Recent Activity

Donate For Us

`git rev-parse` and `git rev-list`

`git filter-branch`