git filter-branch: filter directories by excluding, not including?

Question

Suppose I have this structure in repo:

repo/
  dir1/
  dir2/
  dir3/
  dir4/
  dir5/
  ...

Now I want to keep all directories, except dir1 and dir2.

I can use this command to keep specified directories.

git filter-branch --index-filter 'git rm --cached -qr --ignore-unmatch -- . && git reset -q $GIT_COMMIT -- dir3 dir4 dir5 ... dirN' --prune-empty -- --all

Now if I have many directories, it would be simpler to exclude directories I don't need instead of specifying the ones I need. How could I do that?

torek · Accepted Answer

Change your index-filter to specifically remove the unwanted paths, and not do anything to any of the others, i.e., the --index-filter part becomes:

--index-filter 'git rm --cached -qr --ignore-unmatch dir1 dir2'

The index-filter you use now consists of two separate commands joined by &&. These commands are, in text form:

Remove everything. (The --ignore-unmatch is pointless here since you use . to specify "everything that exists", and "everything that exists" obviously exists.)
Then, put back dir3, dir4, etc., from the current commit.

Since you just want to remove (recursively) "everything in dir1" and "everything in dir2", specify those. Keep the --ignore-unmatch if there may be commits in which no dir1 and/or dir2 files exist. After removing what you want gone, you don't need to put anything back: the index—the temporary index that git filter-branch uses to achieve the filtering (see below)—now has the correct set of files in it.

Sidebar: what is this index anyway?

When you make new commits, Git doesn't actually use the files in your work-tree. They're not important here.

Git has, instead, a thing—implemented mainly as a file named .git/index, really—that Git calls, variously, the index, or the staging area, or (rarely these days) the cache. This index holds a copy of every file taken out of the current commit, initially. You can then use git add or git rm to update the files that are in the index, or take files completely out of the index.

You can think of the index as the proposed next commit. When you run git commit, Git packages up the files that are in the index at that point and freezes them into a new, permanent,¹ read-only commit. The files you see and work with, in your work-tree, are only there for you, not really for Git. That's why, whenever you modify a work-tree file and want the change to go into the next commit, you have to git add all the time: git add tells Git take the work-tree copy and use it to overwrite the index copy so that the next commit will have this version.

When you use git filter-branch, you have a bunch of options. The very slowest one, --tree-filter, takes each commit, copies it into a temporary index—because Git always needs an index for this stuff, even if it's not the regular main one—and then extract all the files from that temporary index into a temporary tree. You can then modify the files in the temporary tree, using your --tree-filter code. Git then re-reads the temporary tree, builds a new (but still temporary) index from that, and uses that to make a new commit.

All of this copying is very slow. So filter-branch gives you --index-filter: this time, Git copies the commit to a temporary index, then lets you modify the temporary index directly. The git rm --cached command modifies the index—or in this case, the temporary index instead—by removing files from it. Then filter-branch makes a new commit from the temporary index. This skips the slowest parts of --tree-filter.

You still wind up copying every commit in the repository to some new-and-improved one, but by doing it only in the temporary index that filter-branch provides, it goes a lot faster.

¹Commits are really only semi-permanent. They last as long as Git can find them. For much more about this, see Think Like (a) Git. When you use filter-branch, you copy some commit(s) to some new-and-improved ones, and have your Git try to forget the originals. Eventually, your Git probably does forget them.

git filter-branch: filter directories by excluding, not including?

Tags:

git

git-filter-branch

Andrius

1 Answers

Sidebar: what is this index anyway?

torek

Recent Activity

Donate For Us

git filter-branch: filter directories by excluding, not including?

Tags:

git

git-filter-branch

Andrius

1 Answers

Sidebar: what is this index anyway?

torek

Related questions

Recent Activity

Donate For Us