Our current repo has tens of thousands of commits and a fresh clone transfers nearly a gig of data (there are lots of jar files that have since been deleted in the history). We'd like to cut this size down by making a new repo that keeps the full history for just the files that are currently active in the repo, or possibly just modify the current repo to clear the deleted file history. But I'm not sure how to do this in a practical manor.
I've tried the script in Remove deleted files from git history:
for del in `cat deleted.txt` do git filter-branch --index-filter "git rm --cached --ignore-unmatch $del" --prune-empty -- --all # The following seems to be necessary every time # because otherwise git won't overwrite refs/original git reset --hard git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d git reflog expire --expire=now --all git gc --aggressive --prune=now done;
But given that we have tens of thousands of deleted files in the history and tens of thousands of commits, running the script would take an eternity. I started running this for just ONE deleted file 2 hours ago and the filter-branch command is still running, it's going through each of the 40,000+ commits one at a time, and this is on a new Macbook pro with an SSD drive.
I've also read the page https://help.github.com/articles/remove-sensitive-data but this only works for removing single files.
Has anyone been able to do this? I really want to preserve history of currently tracked files, I'm not sure if the space savings benefit would be worth creating a new repo if we can't keep the history.
To remove a file from Git, you have to remove it from your tracked files (more accurately, remove it from your staging area) and then commit. The git rm command does that, and also removes the file from your working directory so you don't see it as an untracked file the next time around.
Tracked files are the one handled (version controlled) by Git, that were once added and committed. Untracked files are most of the time files you don't want to be controlled, because for example they are generated by your compiler.
One more important command that you can use is git diff command to check the list of files modified between two Commit IDs. Syntax of this command is git diff --name-only <Start Commit ID>..
simply typing git status gives you a list of staged files, a list of modified yet unstaged files, and a list of untracked files. @houtanb, git status shows you a diff. (It doesn't show you all staged files).
Rather than delete this-list-of-files one at a time, do the almost-opposite: delete everything and just restore the files you want to keep.
Like so:
# for unix $ git checkout master $ git ls-files > keep-these.txt $ git filter-branch --force --index-filter \ "git rm --ignore-unmatch --cached -qr . ; \ cat $PWD/keep-these.txt | tr '\n' '\0' | xargs -d '\0' git reset -q \$GIT_COMMIT --" \ --prune-empty --tag-name-filter cat -- --all
# for macOS $ git checkout master $ git ls-files > keep-these.txt $ git filter-branch --force --index-filter \ "git rm --ignore-unmatch --cached -qr . ; \ cat $PWD/keep-these.txt | tr '\n' '\0' | xargs -0 git reset -q \$GIT_COMMIT --" \ --prune-empty --tag-name-filter cat -- --all
It may be faster to execute.
Once the whole process has finished, then cleanup:
$ rm -rf .git/refs/original/ $ git reflog expire --expire=now --all $ git gc --prune=now # optional extra gc. Slow and may not further-reduce the repo size $ git gc --aggressive --prune=now
Comparing the repository size before and after, should indicate a sizable reduction, and of course only commits that touch the kept files, plus merge commits - even if empty (because that's how --prune-empty works), will be in the history.
The use of $GIT_COMMIT
seems to have caused some confusion, from the git filter-branch documentation (emphasis added):
The argument is always evaluated in the shell context using the eval command (with the notable exception of the commit filter, for technical reasons). Prior to that, the $GIT_COMMIT environment variable will be set to contain the id of the commit being rewritten.
That means git filter-branch
will provide the variable at run time, it's not provided by you before hand. This can be demonstrated if there's any doubt using this no-op filter branch command:
$ git filter-branch --index-filter "echo current commit is \$GIT_COMMIT" Rewrite d832800a85be9ef4ee6fda2fe4b3b6715c8bb860 (1/xxxxx)current commit is d832800a85be9ef4ee6fda2fe4b3b6715c8bb860 Rewrite cd86555549ac17aeaa28abecaf450b49ce5ae663 (2/xxxxx)current commit is cd86555549ac17aeaa28abecaf450b49ce5ae663 ...
Base on AD7six, with renamed files history preserved. (you can skip the preliminary optional section)
remove all remotes:
git remote | while read -r line; do (git remote rm "$line"); done
remove all tags:
git tag | xargs git tag -d
remove all other branches:
git branch | grep -v \* | xargs git branch -D
remove all stashes:
git stash clear
remove all submodules configuration and cache:
git config --local -l | grep submodule | sed -e 's/^\(submodule\.[^.]*\)\(.*\)/\1/g' | while read -r line; do (git config --local --remove-section "$line"); done rm -rf .git/modules/
git ls-files | sed -e 's/^/"/g' -e 's/$/"/g' > keep-these.txt git ls-files | while read -r line; do (git log --follow --raw --diff-filter=R --pretty=format:%H "$line" | while true; do if ! read hash; then break; fi; IFS=$'\t' read mode_etc oldname newname; read blankline; echo $oldname; done); done | sed -e 's/^/"/g' -e 's/$/"/g' >> keep-these.txt git filter-branch --force --index-filter "git rm --ignore-unmatch --cached -qr .; cat \"$PWD/keep-these.txt\" | xargs git reset -q \$GIT_COMMIT --" --prune-empty --tag-name-filter cat -- --all rm keep-these.txt rm -rf .git/refs/original/ git reflog expire --expire=now --all git gc --prune=now
repack (from the-woes-of-git-gc-aggressive):
git repack -a -d --depth=250 --window=250
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With