How to filter history based on gitignore?

Question

To be clear on this question, I am not asking about how to remove a single file from history, like this question: Completely remove file from all Git repository commit history. I am also not asking about untracking files from gitignore, like in this question: Ignore files that have already been committed to a Git repository.

I am talking about "updating a .gitignore file, and subsequently removing everything matching the list from history", more or less like this question: Ignore files that have already been committed to a Git repository. However, unfortunately, the answer from that question does not work for this purpose, so I am here to try elaborating the question and hopefully find a good answer that does not involve a human looking through an entire source tree to manually do a filter-branch on each matched file.

Here I provide a test script, currently performing the procedure in the answer of Ignore files that have already been committed to a Git repository. It is going to remove and create a folder root under PWD, so be careful before running it. I will describe my goal after the code.

#!/bin/bash -e

TESTROOT=${PWD}
GREEN="\e[32m"
RESET="\e[39m"

rm -rf root
mkdir -v root
pushd root

mkdir -v repo
pushd repo
git init

touch a b c x 
mkdir -v main
touch main/{a,x,y,z}

# Initial commit
git add .
git commit -m "Initial Commit"
echo -e "${GREEN}Contents of first commit${RESET}"
git ls-files | tee ../00-Initial.txt

# Add another commit just for demo
touch d e f y z main/{b,c}
## Make some other changes
echo "Test" | tee a | tee b | tee c | tee x | tee main/a > main/x
git add .
git commit -m "Some edits"

echo -e "${GREEN}Contents of second commit${RESET}"
git ls-files | tee ../01-Changed.txt

# Now I want to ignore all 'a' and 'b', and all 'main/x', but not 'main/b'
## Checkout the root commit
git checkout -b temp $(git rev-list HEAD | tail -1)
## Add .gitignores
echo "a" >> .gitignore
echo "b" >> .gitignore
echo "x" >> main/.gitignore
echo "!b" >> main/.gitignore
git add .
git commit --amend -m "Initial Commit (2)"
## --v Not sure if it is correct
git rebase --onto temp master
git checkout master
## --v Now, why should I delete this branch?
git branch -D temp
echo -e "${GREEN}Contents after rebase${RESET}"
git ls-files | tee ../02-Rebased.txt

# Supposingly, rewrite history
git filter-branch --tree-filter 'git clean -f -X' -- --all
echo -e "${GREEN}Contents after filter-branch${RESET}"
git ls-files | tee ../03-Rewritten.txt

echo "History of 'a'"
git log -p a

popd # repo

popd # root

This code creates a repository, adds some files, do some edit, and perform the cleaning procedure. Also, some log files are generated. Ideally, I would like a, b, and main/x disappear from history, while main/b stays. However, right now nothing is removed from history. What should be modified to perform this goal?

Bonus points if this can be done on multiple branches. But for now, keep it to a single master branch.

torek · Accepted Answer

Achieving the result you want is a bit tricky. The simplest way, using git filter-branch with a --tree-filter, will be very slow. Edit: I've modified your example script to do this; see the end of this answer.

First, let's note one constraint: you can never change any existing commit. All you can do is make new commits that look a lot like the old ones, but "new and improved". You then direct Git to stop looking at the old commits, and look only at the new ones. This is what we will do here. (Then, if required, you can force Git to really forget the old commits. The easiest way is to re-clone the clone.)

Now, to re-commit every commit that is reachable from one or more branch and/or tag names, preserving everything except that which we explicitly tell it to change,¹ we can use git filter-branch. The filter-branch command has a rather dizzying array of filtering options, most of which are meant to make it go faster, because copying every commit is pretty slow. If there are just a few hundred commits in a repository, with a few dozens or hundreds of files each, it's not so bad; but if there are about 100k commits holding about 100k files each, that's ten thousand million files (10,000,000,000 files) to examine and re-commit. It is going to take a while.

Unfortunately there is no easy and convenient way to speed this up. The best way to speed it up would be to use an --index-filter, but there is no built in index filter command that will do what you want. The easiest filter to use is --tree-filter, which is also the slowest one there is. You might want to experiment with writing your own index filter, perhaps in shell script or perhaps in another language you prefer (you will still need to invoke git update-index either way).

¹Signed annotated tags cannot be preserved intact, so their signatures will be stripped. Signed commits may have their signatures become invalid (if the commit hash changes, which depends on whether it must: remember that the hash ID of a commit is the checksum of the commit's contents, so if the set of files changes, the checksum changes; but if the checksum of a parent commit changes, the checksum of this commit also changes).

Using `--tree-filter`

When you use git filter-branch with --tree-filter, what the filter-branch code does is to extract each commit, one at a time, into a temporary directory. This temporary directory has no .git directory and is not where you are running git filter-branch (it's actually in a subdirectory of the .git directory unless you use the -d option to redirect Git to, say, a memory filesystem, which is a good idea for speeding it up).

After extracting the entire commit into this temporary directory, Git runs your tree-filter. Once your tree-filter finishes, Git packages up everything in that temporary directory into the new commit. Whatever you leave there, is in. Whatever you add to there, is added. Whatever you modify there, is modified. Whatever you remove from there, is no longer in the new commit.

Note that a .gitignore file in this temporary directory has no effect on what will be committed (but the .gitignore file itself will be committed, since whatever is in the temporary directory becomes the new copy-commit). So if you want to be sure that a file of some known path is not committed, simply rm -f known/path/to/file.ext. If the file was in the temporary directory, it is now gone. If not, nothing happens and all is well.

Hence, a workable tree filter would be:

rm -f $(cat /tmp/files-to-remove)

(assuming no white space issues in file names; use xargs ... | rm -f to avoid white space issues, with whatever encoding you like for the xargs input; -z style encoding is ideal since \0 is forbidden in path names).

Converting this to an index filter

Using an index filter lets Git skip the extract-and-examine phases. If you had a fixed "remove" list in the right form, it would be easy to use.

Let's say you have the file names in /tmp/files-to-remove in a form that is suitable for xargs -0. Your index filter might then read, in its entirety:

xargs -0 /tmp/files-to-remove | git rm --cached -f --ignore-unmatch

which is basically the same as the rm -f above, but works within the temporary index Git uses for each commit-to-be-copied. (Add -q to the git rm --cached to make it quiet.)

Applying `.gitignore` files in a tree filter

Your example script tries to use a --tree-filter after rebasing onto an initial commit that has the desired items:

git filter-branch --tree-filter 'git clean -f -X' -- --all

There is one initial bug though (the git rebase is wrong):

-git rebase --onto temp master
+git rebase --onto temp temp master

Fixing that, the thing still doesn't work, and the reason is that git clean -f -X only removes files that are actually ignored. Any file that is already in the index, is not actually ignored.

The trick is to empty out the index. However, this does too much: git clean then never descends into subdirectories—so the trick comes in two parts: empty out the index, then re-fill it with non-ignored files. Now git clean -f -X will remove the remaining files:

-git filter-branch --tree-filter 'git clean -f -X' -- --all
+git filter-branch --tree-filter 'git rm --cached -qrf . && git add . && git clean -fqX' -- --all

(I added several "quiet" flags here).

To avoid needing to rebase in the first place to install initial .gitignore files, let's say you have a master set of .gitignore files you want in every commit (which we'll then use in the tree filter as well). Simply place these, and nothing else, in a temporary tree:

mkdir /tmp/ignores-to-add
cp .gitignore /tmp/ignores-to-add
mkdir /tmp/ignores-to-add/main
cp main/.gitignore /tmp/ignores-to-add

(I'll leave working up a script that finds and copies just .gitignore files to you, it seems moderately annoying to do without one). Then, for the --tree-filter, use:

cp -R /tmp/ignores-to-add . &&
    git rm --cached -qrf . &&
    git add . &&
    git clean -fqX

The first step, cp -R (which can be done anywhere before the git add ., really), installs the correct .gitignore files. Since we do this to each commit, we never need to rebase before running filter-branch.

The second removes everything from the index. (A slightly faster method is just rm $GIT_INDEX_FILE but it's not guaranteed that this will work forever.)

The third re-adds ., i.e., everything in the temporary tree. Since the .gitignore files are in place, we only add non-ignored files.

The last step, git clean -qfX, removes work-tree files that are ignored, so that filter-branch won't put them back.

lolikandr · Answer

On windows this sequence did not work to me:

cp -R /tmp/ignores-to-add . &&
git rm --cached -qrf . &&
git add . &&
git clean -fqX

But following works.

Update every commit with existed .gitignore:

git filter-branch --index-filter '
  git ls-files -i --exclude-from=.gitignore | xargs git rm --cached -q 
' -- --all

Update .gitignore in the every commit and filter files:

cp ../.gitignore /d/tmp-gitignore
git filter-branch --index-filter '
  cp /d/tmp-gitignore ./.gitignore
  git add .gitignore
  git ls-files -i --exclude-from=.gitignore | xargs git rm --cached -q 
' -- --all
rm /d/tmp-gitignore

Use grep -v if you had special cases, for example file empty to keep empty directory:

git ls-files -i --exclude-from=.gitignore | grep -vE "empty$" | xargs git rm --cached -q

goofology · Answer

This method makes git completely forget ignored files (past/present/future), but does not delete anything from working directory (even when re-pulled from remote).

This method requires usage of /.git/info/exclude (preferred) OR a pre-existing .gitignore in all the commits that have files to be ignored/forgotten. ¹

All methods of enforcing git ignore behavior after-the-fact effectively re-write history and thus have significant ramifications for any public/shared/collaborative repos that might be pulled after this process. ²

General advice: start with a clean repo - everything committed, nothing pending in working directory or index, and make a backup!

Also, the comments/revision history of this answer (and revision history of this question) may be useful/enlightening.

#commit up-to-date .gitignore (if not already existing)
#this command must be run on each branch

git add .gitignore
git commit -m "Create .gitignore"

#apply standard git ignore behavior only to current index, not working directory (--cached)
#if this command returns nothing, ensure /.git/info/exclude AND/OR .gitignore exist
#this command must be run on each branch

git ls-files -z --ignored --exclude-standard | xargs -0 git rm --cached

#Commit to prevent working directory data loss!
#this commit will be automatically deleted by the --prune-empty flag in the following command
#this command must be run on each branch

git commit -m "ignored index"

#Apply standard git ignore behavior RETROACTIVELY to all commits from all branches (--all)
#This step WILL delete ignored files from working directory UNLESS they have been dereferenced from the index by the commit above
#This step will also delete any "empty" commits.  If deliberate "empty" commits should be kept, remove --prune-empty and instead run git reset HEAD^ immediately after this command

git filter-branch --tree-filter 'git ls-files -z --ignored --exclude-standard | xargs -0 git rm -f --ignore-unmatch' --prune-empty --tag-name-filter cat -- --all

#List all still-existing files that are now ignored properly
#if this command returns nothing, it's time to restore from backup and start over
#this command must be run on each branch

git ls-files --other --ignored --exclude-standard

Finally, follow the rest of this GitHub guide (starting at step 6) which includes important warnings/information about the commands below.

git push origin --force --all
git push origin --force --tags
git for-each-ref --format="delete %(refname)" refs/original | git update-ref --stdin
git reflog expire --expire=now --all
git gc --prune=now

Other devs that pull from now-modified remote repo should make a backup and then:

#fetch modified remote

git fetch --all

#"Pull" changes WITHOUT deleting newly-ignored files from working directory
#This will overwrite local tracked files with remote - ensure any local modifications are backed-up/stashed
#Switching branches after this procedure WILL LOOSE all newly-gitignored files in working directory because they are no longer tracked when switching branches

git reset FETCH_HEAD

Footnotes

¹ Because /.git/info/exclude can be applied to all historical commits using the instructions above, perhaps details about getting a .gitignore file into the historical commit(s) that need it is beyond the scope of this answer. I wanted a proper .gitignore to be in the root commit, as if it was the first thing I did. Others may not care since /.git/info/exclude can accomplish the same thing regardless where the .gitignore exists in the commit history, and clearly re-writing history is a very touchy subject, even when aware of the ramifications.

FWIW, potential methods may include git rebase or a git filter-branch that copies an external .gitignore into each commit, like the answers to this question

² Enforcing git ignore behavior after-the-fact by committing the results of a standalone git rm --cached command may result in newly-ignored file deletion in future pulls from the force-pushed remote. The --prune-empty flag in the following git filter-branch command avoids this problem by automatically removing the previous "delete all ignored files" index-only commit. Re-writing git history also changes commit hashes, which will wreak havoc on future pulls from public/shared/collaborative repos. Please understand the ramifications fully before doing this to such a repo. This GitHub guide specifies the following:

Tell your collaborators to rebase, not merge, any branches they created off of your old (tainted) repository history. One merge commit could reintroduce some or all of the tainted history that you just went to the trouble of purging.

Alternative solutions that do not affect the remote repo are git update-index --assume-unchanged </path/file> or git update-index --skip-worktree <file>, examples of which can be found here.

How to filter history based on gitignore?

Tags:

git

gitignore

rebase

Carl Dong

3 Answers

Using `--tree-filter`

Converting this to an index filter

Applying `.gitignore` files in a tree filter

torek

lolikandr

Footnotes

goofology

Recent Activity

Donate For Us

How to filter history based on gitignore?

Tags:

git

gitignore

rebase

Carl Dong

3 Answers

Using --tree-filter

Converting this to an index filter

Applying .gitignore files in a tree filter

torek

lolikandr

Footnotes

goofology

Related questions

Recent Activity

Donate For Us

Using `--tree-filter`

Applying `.gitignore` files in a tree filter