Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

git filter-branch - discard the changes to a set of files in a range of commits

Say I have a branch dev and I want to discard all the changes made to a set of files in the rage of commits in dev branch since it diverged from master. If a commit in this range only touches those files I'd liked it pruned. The closest I got was :

git checkout dev
git filter-branch --force --tree-filter 'git checkout master -- \
a/b/c.png \
...
' --prune-empty -- master-dev-older-ancestor..HEAD

but this has these drawbacks

  1. if the file was since deleted in master it will fail with error: pathspec 'a/b/c.png' did not match any file(s) known to git. I might decide to git checkout master-dev-older-ancestor but then,
  2. this file may not exist in master-dev-older-ancestor, and was merged in from master back to dev at a later point
  3. after all I may want to discard changes to some files that are nowhere to be seen in master

Fundamentally the point is that I do not want tell git to checkout a specific version of the file - I want to tell git to filter all commits in the range master-dev-older-ancestor..HEAD to have all changes in an arbitrary set of files (present anywhere on master or not) discarded.

So how do I tell git ?

like image 564
Mr_and_Mrs_D Avatar asked Mar 08 '14 15:03

Mr_and_Mrs_D


1 Answers

Fundamentally, what filter-branch does is this—everything else is optimization and/or edge-cases:1

  • For each commit in the listed revision(s):
    1. check out that commit;
    2. apply the filter(s);
    3. create a new commit, which may or may not be the same as the old commit depending on step 2 (i.e., this new copy is a modified version of the old one, unless it's bit-for-bit identical, in which case the "created new" commit is actually just the old commit after all).
  • For each "positive" ref on the command line, rewrite it to point to the new commit made in step 3 wherever it pointed to an old commit checked out in step 1.

Now let's consider your desired action, but I'm going to emphasize a different word:

filter all commits in [a] range ... to have all changes in an arbitrary set of files ... discarded

I emphasize "changes" here because each commit is a complete, stand-alone entity. Commits don't have "changes", they just have files. The only way to see changes is to compare one specific commit against another specific commit: git diff commitA commitB for example.

Thus, when you say "changes to some file(s)", the immediate obvious question should be: changes with respect to what?

In most cases, people who talk about "changes in a commit" mean "changes in this commit with respect to its immediate ancestor": for simple (non-merge) commits, the patch you'd get with git show or git log -p. (Usually they have not thought about what they mean if the commit is a merge, and therefore has multiple parents. For these, git show generally shows a combined diff of the merge commit against all its parents, but that may not match the user's intent here; see the git-show documentation for details.)

When using git filter-branch, you will have to define this (changes with respect to what) yourself. The filter-branch command gives you the SHA-1 ID of the checked-out commit—even if it's only "virtually" checked out in step 1, rather than actually stuffed into an on-disk tree—in the environment variable $GIT_COMMIT. So, if your definition of "with respect to what" is "with respect to first parent", you can use gitrevisions syntax to refer to the parent: ${GIT_COMMIT}^ is the first-parent, even when ${GIT_COMMIT} is a raw SHA-1.

A very crude and un-optimized --tree-filter that simply extracts the parent versions of each such file goes like this:2

for path in ...list-of-paths...; do
    git checkout -q ${GIT_COMMIT}^ -- $path 2>/dev/null
done
exit 0 # in case the last "git checkout" failed, override its status

which simply asks git to retrieve the parent commit's version of the file, discarding any error message that occurs because the file does not exist in the parent version. But this may not match your intent either: it's not clear whether you want to remove the file if it is not in the parent. Moreover, if a file is added or removed somewhere in the sequence of commits in your range, comparing each original commit only to its (single) original parent commit may mis-fire. For instance, if file foo does not exist in commit C5, does exist in C6, and remains unchanged in C7, the comparison between C7 and C6 says "file unchanged" while the earlier comparison of C5-to-C6 says "file added". If your new (altered) C6—let's call it C6' to tell them apart—removes foo because it was not in C5, presumably your C7' should also omit file foo.

Another alternative is to compare each commit to the (single) commit just before the entire range. If your range covers commits C1, C2, C3, ..., C9, we can call the single previous commit C0. Then, instead of comparing C1 to C1^, C2 to C2^, and so on, we can compare C1 to C0, C2 to C0, C3 to C0, and so on. Depending on your definition of "changes", this may be exactly what you want, because "undoing a change" may be transitive: we remove foo in our new C6, therefore we must remove foo in our new C7 as well; we add back bar in the new C7, therefore we must add it back in the new C8 as well, and so on.

A less-crude version of the comparison script goes like this (this can be optimized for --index-filter as well, although I will leave the work up to someone else since this is meant for illustration):

# Note: I haven't tested this either, not sure how it behaves if
# used inside git filter-branch.  As a --tree-filter you would not
# really want to "git rm" anything, just to "rm" it.  As an
# --index-filter you would want to "git rm --cached".  For
# checkout, as a tree filter you want to extract the file into
# the working tree, and as an index filter you want to extract
# the file into the index.
git diff --name-status --no-renames $WITH_RESPECT_TO $GIT_COMMIT \
    -- ...paths... |
while read status path; do
    # note: $path may have embedded white space, so we
    # quote it below to protect it from breaking into words
    case $status in
    A) git rm -- "$path";; # file was added, rm it to undo
    D|M) git checkout $WITH_RESPECT_TO -- "$path";; # deleted or modified
    *) echo "file $path has strange status $status, help!" 1>&2; exit 1;;
    esac
done

Explanation: the above assumes you're filtering a (maybe linear, maybe branch-y) series of commits C1, C2, ..., Cn. You want them to "not alter the contents or even existence" of some set of paths, with respect to some parent-of-C1 commit. You must set an appropriate specifier into $WITH_RESPECT_TO. (This can come from the environment, or just be hard-coded into an actual script. Note that for your --index-filter or --tree-filter, you can have the shell run a script, rather than trying to do it all in line.)

For instance, if you're filtering X..Y, which means "all commits reachable from label Y excluding all commits reachable from label X", it's possible that the appropriate value for $WITH_RESPECT_TO is simply X, but it is more likely the merge-base of X and Y. If X and Y are branches that look something like this:

...-o-o-o-o-o-o   <-- master
     \
      *-o-o       <-- X
       \
        o-o-o-o   <-- Y

then you're filtering the commits on the bottom row, and the first commit to be filtered should probably be "unchanged with respect to some paths as seen in commit *" (the commit I marked with an asterisk). That's the commit that git merge-base X Y would come up with.

If you're working with raw SHA-1 IDs, you might be able to use something like:

WITH_RESPECT_TO=676699a0e0cdfd97521f3524c763222f1c30a094 \
git filter-branch ... (filter-branch arguments go here) ... --
676699a0e0cdfd97521f3524c763222f1c30a094..branch

where the raw SHA-1 is the ID of commit *, as it were.

As for the git diff itself, let's look at the sort of output it produces:

$ git diff --name-status --no-renames \
>  2cd861672e1021012f40597b9b68cc3a9af62e10 \
>  7bbc4e8fdb33e0a8e42e77cc05460d4c4f615f4d
M       Documentation/RelNotes/1.8.5.4.txt
A       Documentation/RelNotes/1.8.5.5.txt
M       Documentation/git.txt
M       GIT-VERSION-GEN
M       RelNotes

(this is actual output of git diff on the source tree for git itself). Between those two revisions, one release-notes text file was modified, one was added, Documentation/git.txt was modified, and so on. Now let's try that again but restricting it to one real pathname and one fake one:

$ git diff --name-status --no-renames \
>  2cd861672e1021012f40597b9b68cc3a9af62e10 \
>  7bbc4e8fdb33e0a8e42e77cc05460d4c4f615f4d \
>  -- Documentation/RelNotes/1.8.5.5.txt NoSuchFile
A       Documentation/RelNotes/1.8.5.5.txt

Now we find out about the one added file, but there is no complaint about the nonexistent file. So it's OK to give "nonexistent" paths; they simply won't occur in the output.

If diffing commit $WITH_RESPECT_TO against some later commit C says that path p is added in commit C, we know that it does not exist in $WITH_RESPECT_TO and does in C, so we want to remove it so that it's "unchanged". (This is the case for status-letter A.)

IF the diff says that path p is deleted in C, we know that it does exist in the first, and must be restored to remain "unchanged". (This is the case for status-letter D.)

If the diff says that path p exists in both, but the contents of the file differ in C, the contents must be restored to remain "unchanged". (This is the case for status-letter M.)

Other diff status letters are C, R, T, U, X, and B, but some cannot occur (we exclude C, R, and B by specifying appropriate git diff options; U only occurs during incomplete merges; and X should never occur: see What do the Git “pairing broken” and “unknown” statuses mean, and when do they occur?). The T case is possibly cause to abort the filtering (regular file changed to symlink, or vice versa, for instance; or something replaced with a submodule).


If, after thinking about the issue for a while, you decide that "with respect to" should use parent commit(s), you can use git diff-tree, which—given a single commit—compares the tree of the commit with those of its parents. (But again, note its behavior on merge commits, and make sure that's what you want.)


1 When using --tree-filter, it actually does the full blown check-everything-out part. With --index-filter it writes the commit into the index, but not actually into the file system, and lets you make all the changes within the index. With --env-filter, --msg-filter, --parent-filter, and --commit-filter, it lets you change the text, author, and/or parents of each commit. The --tag-name-filter lets you alter the tag names if needed, and causes the new names to point to the new commits instead of the old ones (hence --tag-name-filter cat leaves the names unchanged and makes those that pointed to the old commits, now point to the new ones).

The --prune-empty covers an edge case: if you have a chain of commits C1 <- C2 <- C3, and your C2' (your copy of C2) has the same underlying tree as your C1', comparing the trees of C2' and C1' produces an empty diff. The filter-branch operation normally keeps these, but omits them if you use --prune-empty: your new chain will then be C1' <- C3'. But note that the original chain may have "empty" commits; in this case, filter-branch will prune those even if the copies are actually the same as the originals.

2 These scripts are written as if in script files. If you turn them into one-liners you will need to add semicolons as necessary, and perhaps also turn exit into return, since you don't want the whole thing to exit when evaled.

like image 121
torek Avatar answered Oct 13 '22 00:10

torek