New repo with copied history of only currently tracked files

Tags:

git

Our current repo has tens of thousands of commits and a fresh clone transfers nearly a gig of data (there are lots of jar files that have since been deleted in the history). We'd like to cut this size down by making a new repo that keeps the full history for just the files that are currently active in the repo, or possibly just modify the current repo to clear the deleted file history. But I'm not sure how to do this in a practical manor.

I've tried the script in Remove deleted files from git history:

for del in `cat deleted.txt` do     git filter-branch --index-filter "git rm --cached --ignore-unmatch $del" --prune-empty -- --all     # The following seems to be necessary every time     # because otherwise git won't overwrite refs/original     git reset --hard     git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d     git reflog expire --expire=now --all     git gc --aggressive --prune=now done;

But given that we have tens of thousands of deleted files in the history and tens of thousands of commits, running the script would take an eternity. I started running this for just ONE deleted file 2 hours ago and the filter-branch command is still running, it's going through each of the 40,000+ commits one at a time, and this is on a new Macbook pro with an SSD drive.

I've also read the page https://help.github.com/articles/remove-sensitive-data but this only works for removing single files.

Has anyone been able to do this? I really want to preserve history of currently tracked files, I'm not sure if the space savings benefit would be worth creating a new repo if we can't keep the history.

966

asked Jul 27 '13 19:07

Brent Sowers

2 Answers

Delete everything and restore what you want

Rather than delete this-list-of-files one at a time, do the almost-opposite: delete everything and just restore the files you want to keep.

Like so:

# for unix  $ git checkout master $ git ls-files > keep-these.txt $ git filter-branch --force --index-filter \   "git rm  --ignore-unmatch --cached -qr . ; \   cat $PWD/keep-these.txt | tr '\n' '\0' | xargs -d '\0' git reset -q \$GIT_COMMIT --" \   --prune-empty --tag-name-filter cat -- --all

# for macOS  $ git checkout master $ git ls-files > keep-these.txt $ git filter-branch --force --index-filter \   "git rm  --ignore-unmatch --cached -qr . ; \   cat $PWD/keep-these.txt | tr '\n' '\0' | xargs -0 git reset -q \$GIT_COMMIT --" \   --prune-empty --tag-name-filter cat -- --all

It may be faster to execute.

Cleanup steps

Once the whole process has finished, then cleanup:

$ rm -rf .git/refs/original/ $ git reflog expire --expire=now --all $ git gc --prune=now  # optional extra gc. Slow and may not further-reduce the repo size $ git gc --aggressive --prune=now

Comparing the repository size before and after, should indicate a sizable reduction, and of course only commits that touch the kept files, plus merge commits - even if empty (because that's how --prune-empty works), will be in the history.

$GIT_COMMIT?

The use of $GIT_COMMIT seems to have caused some confusion, from the git filter-branch documentation (emphasis added):

The argument is always evaluated in the shell context using the eval command (with the notable exception of the commit filter, for technical reasons). Prior to that, the $GIT_COMMIT environment variable will be set to contain the id of the commit being rewritten.

That means git filter-branch will provide the variable at run time, it's not provided by you before hand. This can be demonstrated if there's any doubt using this no-op filter branch command:

$ git filter-branch --index-filter "echo current commit is \$GIT_COMMIT" Rewrite d832800a85be9ef4ee6fda2fe4b3b6715c8bb860 (1/xxxxx)current commit is d832800a85be9ef4ee6fda2fe4b3b6715c8bb860 Rewrite cd86555549ac17aeaa28abecaf450b49ce5ae663 (2/xxxxx)current commit is cd86555549ac17aeaa28abecaf450b49ce5ae663 ...

answered Oct 12 '22 00:10

AD7six

Base on AD7six, with renamed files history preserved. (you can skip the preliminary optional section)

Optional

remove all remotes:

git remote | while read -r line; do (git remote rm "$line"); done

remove all tags:

git tag | xargs git tag -d

remove all other branches:

git branch | grep -v \* | xargs git branch -D

remove all stashes:

git stash clear

remove all submodules configuration and cache:

git config --local -l | grep submodule | sed -e 's/^\(submodule\.[^.]*\)\(.*\)/\1/g' | while read -r line; do (git config --local --remove-section "$line"); done rm -rf .git/modules/

Pruning untracked files history, keeping tracked files history & renames

git ls-files | sed -e 's/^/"/g' -e 's/$/"/g' > keep-these.txt git ls-files | while read -r line; do (git log --follow --raw --diff-filter=R --pretty=format:%H "$line" | while true; do if ! read hash; then break; fi; IFS=$'\t' read mode_etc oldname newname; read blankline; echo $oldname; done); done | sed -e 's/^/"/g' -e 's/$/"/g' >> keep-these.txt git filter-branch --force --index-filter "git rm --ignore-unmatch --cached -qr .; cat \"$PWD/keep-these.txt\" | xargs git reset -q \$GIT_COMMIT --" --prune-empty --tag-name-filter cat -- --all rm keep-these.txt rm -rf .git/refs/original/ git reflog expire --expire=now --all git gc --prune=now

First two commands are to list tracked files and tracked files old names, using quotes to preserve paths with spaces.
Third command is to rewrite the commits for those files only.
Subsequent commands are to clean the history.

Optional (not recommended)

repack (from the-woes-of-git-gc-aggressive):

git repack -a -d --depth=250 --window=250

answered Oct 12 '22 02:10

Cœur

Related questions
                            
                                ssh key passphrase works in windows but not in linux
                            
                                Download a Git repo from BitBucket
                            
                                How can I overwrite, not merge, one remote branch into another branch?
                            
                                How to reset git authentication?
                            
                                What does git "updating currently checked out branch" warning mean?
                            
                                `--name` option doesn't work with `git submodule add` command
                            
                                Reset/revert a whole branch to another branches state?
                            
                                Gitlab execute stage conditionally
                            
                                Difference between SCM and SVN
                            
                                How to fix error 'not found husky-run' when committing new code?
                            
                                Can git-svn correctly populate svn:mergeinfo properties?
                            
                                Issue with ignoring subdirectory in git
                            
                                How to undo git reset --soft to get my changes back?
                            
                                What's the best way to replace remote.origin.url in Git?
                            
                                Make git master HEAD point to current HEAD of branch
                            
                                Git push everything to new origin
                            
                                What is my bottleneck when cloning a git repository from a virtual machine with a fast network connection?
                            
                                fatal: Unable to read current working directory: No such file or directory
                            
                                Using Git diff to detect code movement + How to use diff options
                            
                                committing to the same branch with git

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With