Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any way to use filter-branch in a incremental sense

Is there any way to use filter-branch in a incremental manner on a branch?

roughly speaking like this (but this isn't actually working):

git checkout -b branchA origin/branchA  
git branch headBranchA  
# inital rewrite   
git filter-branch ... -- branchA  
git fetch origin  
# incremental rewrite  
git filter-branch ... -- headBranchA..origin/branchA  
git merge origin/branchA  
like image 795
Gert Avatar asked Dec 11 '22 07:12

Gert


2 Answers

I'm not sure what you're really trying to achieve, so what I will say here is "yes, sort of, but probably not what you're thinking and it might not help you achieve your goal, whatever that is".

It's important to understand here not just what filter-branch does, but also, to some extent, how it does it.


Background (to make this answer useful to others)

A git repository contains some commit-graph(s). These are found by taking some starting commit nodes, found via external references—mostly branch and tag names, but also annotated tags which I'll just sort of gloss over as not particularly important to this case—and then using those starting nodes to find more nodes, until all "reachable" nodes have been found.

Each commit has zero or more "parent commits". Most ordinary commits have one parent; merges have two or more parents. A root commit (such as the initial commit in a repository) has no parents.

Branch names point to one particular commit, which points back to its parent(s), and so on.

  B-C-D
 /     \
A---E---F   <-- master
 \
  G     J   <-- branch1
   \   /
    H-I-K   <-- branch2

Branch name master points to commit F (which is a merge commit). The names branch1 and branch2 point to commits J and K respectively.

Let's also note that, because commits point to their parents, the "reachable set" from name master is A B C D E F, the set for branch1 is A G H I J, and the set for branch2 is A G H I K.

The "true name" of each commit node is its SHA-1, which is a cryptographic checksum of the contents of the commit. The contents include SHA-1 checksums of the corresponding work-tree contents and the SHA-1s of the parent commits. Thus, if you go to copy a commit and change nothing (not one single bit) you get the same SHA-1 back and hence wind up with the same commit; but if you change even a single bit (including, e.g., changing the spelling of the committer's name, any time stamps, or any part of the associated work-tree), you get a new, different commit.

git rev-parse and git rev-list

These two commands are quite central to most git operation.

The rev-parse command turns any valid git revision specifier into a commit-ID. (It also has a lot of what we might call "assistance modes", that allow writing most git commands as shell scripts—and git filter-branch is in fact a shell script.)

The rev-list command turns a revision range (also in gitrevisions) into a list of commit-IDs. Given just a branch name, it finds the set of all revisions reachable from that branch, so with the example commit graph above, given branch2, it lists the SHA-1 values for commits A, G, H, I, and K. (It defaults to listing them in reverse chronological order, but can be told to list them in "topographic order", which is important to filter-branch, not that I intend to get that deep into the details here.)

In this case, though, you will want to use "commit limiting": given a revision range, like the A..B syntax, or given things like B ^A, git rev-list limits its output rev-sets to commits that are reachable from B, but not reachable from A. Hence, given branch2~3..branch2—or euivalently, branch2 ^branch2~3—it lists the SHA-1 values for H, I, and K. This is because branch2~3 names commit G, so commits A and G are pruned away from the reachable set.


git filter-branch

The filter-branch script is fairly complex but summarizing its action on "ref names given on the command line" is not too hard.

First, it uses git rev-parse to find the actual head revisions of the branch or branches to be filtered. It uses it twice, in fact: once to get SHA-1 values, and once to get names. Given, e.g., headBranchA..origin/branchA, it needs to get the "true full name" refs/remotes/origin/branchA:

git rev-parse --revs-only --symbolic-full-name headBranchA..origin/branchA

will print:

refs/remotes/origin/branchA
^refs/heads/headBranchA

The filter-branch script discards any ^-prefixed results to get a list of "positive ref names"; these are what it intends to rewrite, in the end.

These are the "positive refs" described in the git-filter-branch manual.

Then it uses git rev-list to get a complete list of commit SHA-1s on which to apply the filters. This is where the headBranchA..origin/branchA limiting syntax comes in: the script now knows to work only on commits reachable from origin/branchA, but not from headBranchA.

Once it has the list of commit IDs, git filter-branch actually applies the filters. These make new commits.

As always, if the new commits are exactly identical to the original commits, the commit-IDs are unchanged. If filter-branch is to be useful, though, presumably at some point, some commits are changed, giving them new SHA-1s. Any immediate children of those commits have to acquire new parent IDs, so those commits are also changed, and those changes propagate down to the ultimate branch-tips.

Finally, having applied the filters to all the listed commits, the filter-branch script updates the "positive refs".


The next part depends on your actual filters. Let's just assume for illustration that your filter changes the spelling of an author name on every commit, or changes the time-stamp on every commit, or some such, so that every commit is rewritten, except for some reason it leaves the root commit unchanged, so that the new branch and the old one do have a common ancestor.

We start with this:

git checkout -b branchA origin/branchA

(you are now on branchA, i.e., HEAD contains ref: refs/heads/branchA)

git branch headBranchA

(this makes another branch label pointing to the current HEAD commit but does not alter HEAD)

# inital rewrite
git filter-branch ... -- branchA

The "positive ref" in this case is branchA. The commits to be rewritten are every commit reachable from branchA, i.e., all the o nodes below (starting commit graph made up for illustration here), except for the root commit R:

R-o-o-x-x-x   <-- master
     \
      o-o-o   <-- headBranchA, HEAD=branchA, origin/branchA

Every o commit is copied, and branchA is moved to point to the last new one:

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- headBranchA, origin/branchA
 \
  *-*-*-*-*   <-- HEAD=branchA

Later, you go to pick up new stuff from remote origin:

git fetch origin

Let's say this adds commits labeled n (and I'll just add one):

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- headBranchA
|          \
|           n <-- origin/branchA
 \
  *-*-*-*-*   <-- HEAD=branchA

Here's where things go wrong:

git filter-branch ... -- headBranchA..origin/branchA

The "positive ref" here is origin/branchA, so that's what will be moved. The commits selected by the rev-list are just those marked n, which is what you want. Let's spell the rewritten commit N (uppercase) this time:

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- headBranchA
|         |\
|         | n [semi-abandoned - filter-branch writes refs/original/...]
|          \
|           N <-- origin/branchA
 \
  *-*-*-*-*   <-- HEAD=branchA

And now you attempt to git merge origin/branchA, which means to git merge commit N, which requires finding the merge base between the * chain and commit N ... and that's commit R.

This is not, I assume, what you meant to do at all.

I suspect what you want to do is, instead, cherry-pick commit N onto the * chain. Let's draw that in:

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- headBranchA
|         |\
|         | n [semi-abandoned - filter-branch writes refs/original/...]
|          \
|           N <-- origin/branchA
 \
  *-*-*-*-*-N'<-- HEAD=branchA

This part is OK, but it's left a mess for the future. It turns out you don't actually want commit N at all, and you don't want to move origin/branchA, because (I assume) you'd like to be able to repeat the git fetch origin step later. So let's "undo" this and try something different. Let's drop the headBranchA label entirely and start with this:

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- origin/branchA
 \
  *-*-*-*-*   <-- HEAD=branchA

Let's add a temporary marker for the commit to which origin/branchA points, and run git fetch origin, so that we get commit n:

R-o-o-x-x-x     <-- master
|    \     .--------temp
|     o-o-o-n   <-- origin/branchA
 \
  *-*-*-*-*     <-- HEAD=branchA

Now let's copy commit n to branchA, and while we're copying it, modify it too (doing whatever mods you would do with git filter-branch) to get a commit we'll just call N:

R-o-o-x-x-x     <-- master
|    \     .--------temp
|     o-o-o-n   <-- origin/branchA
 \
  *-*-*-*-*-N    <-- HEAD=branchA

When this is done we erase temp and we're ready to repeat the cycle.


Making it work

That leaves several problems. The most obvious is: how do we copy n (or several/many ns) and then modify them? Well, the easy way, assuming you have your filter-branch already working, is to use git cherry-pick to copy them, then git filter-branch to filter them.

This only works if the cherry-pick step is not going to run into tree-difference issues, so it depends on what your filter does:

# all of this to be done while on branchA
git tag temp origin/branchA
git fetch origin # pick up `n` commit(s)

git tag temp2    # mark the point for filtering
git cherry-pick temp..origin/branchA
git filter-branch ... -- temp2..branchA

# remove temporary markers
git tag -d temp temp2

What if your filter-branch alters the tree, so that this method won't always work? Well, we can resort to applying the filter directly to the n commits, giving n' commits, then copy the n' commits. Those (n'') commits are the ones that will live on the local (filtered) branchA. The n' commits are not needed once they've been copied, so we discard them.

# lay down temporary marker as before, and fetch
git tag temp origin/branchA
git fetch origin

# now make a new branch, just for filtering
git checkout -b temp2 origin/branchA
git filter-branch ... -- temp..temp2
# the now-altered new branch, temp..temp2, has filtered commits n'

# copy n' commits to n'' commits on branchA
git checkout branchA
git cherry-pick temp..temp2

# and finally, delete the temporary marker and the temporary branch
git tag -d temp
git branch -D temp2 # temp2 requires a force-delete

Other problems

We've covered (in the graph drawings) how new commits get copied-and-modified into your "incrementally filtered" branchA. But what happens if, when you go consult origin, you find that commits were removed?

That is, we start with this:

R-o-o-x-x-x   <-- master
|    \
|     o-o-o   <-- origin/branchA
 \
  *-*-*-*-*   <-- HEAD=branchA

We lay down our temporary marker as usual and do git fetch origin. But what they did was remove the last o commit, with a force-push on their end. Now we have:

R-o-o-x-x-x   <-- master
|    \
|     o-o     <-- origin/branchA
|        `o.......temp
 \
  *-*-*-*-*   <-- HEAD=branchA

The implication here is that we probably should back branchA up one revision as well.

Whether you want to handle this at all is up to you. I'll note here that the result of git rev-list temp..origin/branchA will be empty in this particular case (there are no commits on the revised origin/branchA that are not reachable from temp), but origin/branchA..temp will not be empty: it will list the one "removed" commit. If two commits were removed, it would list the two commits, and so on.

It's possible for whoever controls origin to have removed several commits and added some other new commits (in fact, this is exactly what happens with an "upstream rebase"). In this case, both git rev-list commands will be non-empty: origin/branchA..temp will show you what was removed, and temp..origin/branchA will show you what was added.

Last, it's possible for whoever controls origin to completely wreck everything for you. They can:

  • remove their branchA entirely, or
  • make their label branchA point to an unrelated branch.

Again, it's up to you whether, and if so how, to handle these cases.

like image 182
torek Avatar answered Dec 25 '22 04:12

torek


Git 2.18 (Q2 2018) does propose an incremental filtering now.

"git filter-branch" learned to use a different exit code to allow the callers to tell the case where there was no new commits to rewrite from other error cases.

See commit 0a0eb2e (15 Mar 2018) by Michele Locati (mlocati).
(Merged by Junio C Hamano -- gitster -- in commit cb3e97d, 09 Apr 2018)

filter-branch: return 2 when nothing to rewrite

Using the --state-branch option allows us to perform incremental filtering. This may lead to having nothing to rewrite in subsequent filtering, so we need a way to recognize this case.
So, let's exit with 2 instead of 1 when this "error" occurs.

like image 45
VonC Avatar answered Dec 25 '22 02:12

VonC