Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

New repo with copied history of only currently tracked files

Tags:

git

Our current repo has tens of thousands of commits and a fresh clone transfers nearly a gig of data (there are lots of jar files that have since been deleted in the history). We'd like to cut this size down by making a new repo that keeps the full history for just the files that are currently active in the repo, or possibly just modify the current repo to clear the deleted file history. But I'm not sure how to do this in a practical manor.

I've tried the script in Remove deleted files from git history:

for del in `cat deleted.txt` do     git filter-branch --index-filter "git rm --cached --ignore-unmatch $del" --prune-empty -- --all     # The following seems to be necessary every time     # because otherwise git won't overwrite refs/original     git reset --hard     git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d     git reflog expire --expire=now --all     git gc --aggressive --prune=now done; 

But given that we have tens of thousands of deleted files in the history and tens of thousands of commits, running the script would take an eternity. I started running this for just ONE deleted file 2 hours ago and the filter-branch command is still running, it's going through each of the 40,000+ commits one at a time, and this is on a new Macbook pro with an SSD drive.

I've also read the page https://help.github.com/articles/remove-sensitive-data but this only works for removing single files.

Has anyone been able to do this? I really want to preserve history of currently tracked files, I'm not sure if the space savings benefit would be worth creating a new repo if we can't keep the history.

like image 966
Brent Sowers Avatar asked Jul 27 '13 19:07

Brent Sowers


People also ask

How to remove tracking of file git?

To remove a file from Git, you have to remove it from your tracked files (more accurately, remove it from your staging area) and then commit. The git rm command does that, and also removes the file from your working directory so you don't see it as an untracked file the next time around.

What is tracked and untracked files in git?

Tracked files are the one handled (version controlled) by Git, that were once added and committed. Untracked files are most of the time files you don't want to be controlled, because for example they are generated by your compiler.

How to see files changed git?

One more important command that you can use is git diff command to check the list of files modified between two Commit IDs. Syntax of this command is git diff --name-only <Start Commit ID>..

How to check the staged files in git?

simply typing git status gives you a list of staged files, a list of modified yet unstaged files, and a list of untracked files. @houtanb, git status shows you a diff. (It doesn't show you all staged files).


2 Answers

Delete everything and restore what you want

Rather than delete this-list-of-files one at a time, do the almost-opposite: delete everything and just restore the files you want to keep.

Like so:

# for unix  $ git checkout master $ git ls-files > keep-these.txt $ git filter-branch --force --index-filter \   "git rm  --ignore-unmatch --cached -qr . ; \   cat $PWD/keep-these.txt | tr '\n' '\0' | xargs -d '\0' git reset -q \$GIT_COMMIT --" \   --prune-empty --tag-name-filter cat -- --all 
# for macOS  $ git checkout master $ git ls-files > keep-these.txt $ git filter-branch --force --index-filter \   "git rm  --ignore-unmatch --cached -qr . ; \   cat $PWD/keep-these.txt | tr '\n' '\0' | xargs -0 git reset -q \$GIT_COMMIT --" \   --prune-empty --tag-name-filter cat -- --all 

It may be faster to execute.

Cleanup steps

Once the whole process has finished, then cleanup:

$ rm -rf .git/refs/original/ $ git reflog expire --expire=now --all $ git gc --prune=now  # optional extra gc. Slow and may not further-reduce the repo size $ git gc --aggressive --prune=now 

Comparing the repository size before and after, should indicate a sizable reduction, and of course only commits that touch the kept files, plus merge commits - even if empty (because that's how --prune-empty works), will be in the history.

$GIT_COMMIT?

The use of $GIT_COMMIT seems to have caused some confusion, from the git filter-branch documentation (emphasis added):

The argument is always evaluated in the shell context using the eval command (with the notable exception of the commit filter, for technical reasons). Prior to that, the $GIT_COMMIT environment variable will be set to contain the id of the commit being rewritten.

That means git filter-branch will provide the variable at run time, it's not provided by you before hand. This can be demonstrated if there's any doubt using this no-op filter branch command:

$ git filter-branch --index-filter "echo current commit is \$GIT_COMMIT" Rewrite d832800a85be9ef4ee6fda2fe4b3b6715c8bb860 (1/xxxxx)current commit is d832800a85be9ef4ee6fda2fe4b3b6715c8bb860 Rewrite cd86555549ac17aeaa28abecaf450b49ce5ae663 (2/xxxxx)current commit is cd86555549ac17aeaa28abecaf450b49ce5ae663 ... 
like image 66
AD7six Avatar answered Oct 12 '22 00:10

AD7six


Base on AD7six, with renamed files history preserved. (you can skip the preliminary optional section)

Optional

remove all remotes:

git remote | while read -r line; do (git remote rm "$line"); done 

remove all tags:

git tag | xargs git tag -d 

remove all other branches:

git branch | grep -v \* | xargs git branch -D 

remove all stashes:

git stash clear 

remove all submodules configuration and cache:

git config --local -l | grep submodule | sed -e 's/^\(submodule\.[^.]*\)\(.*\)/\1/g' | while read -r line; do (git config --local --remove-section "$line"); done rm -rf .git/modules/ 

Pruning untracked files history, keeping tracked files history & renames

git ls-files | sed -e 's/^/"/g' -e 's/$/"/g' > keep-these.txt git ls-files | while read -r line; do (git log --follow --raw --diff-filter=R --pretty=format:%H "$line" | while true; do if ! read hash; then break; fi; IFS=$'\t' read mode_etc oldname newname; read blankline; echo $oldname; done); done | sed -e 's/^/"/g' -e 's/$/"/g' >> keep-these.txt git filter-branch --force --index-filter "git rm --ignore-unmatch --cached -qr .; cat \"$PWD/keep-these.txt\" | xargs git reset -q \$GIT_COMMIT --" --prune-empty --tag-name-filter cat -- --all rm keep-these.txt rm -rf .git/refs/original/ git reflog expire --expire=now --all git gc --prune=now 
  • First two commands are to list tracked files and tracked files old names, using quotes to preserve paths with spaces.
  • Third command is to rewrite the commits for those files only.
  • Subsequent commands are to clean the history.

Optional (not recommended)

repack (from the-woes-of-git-gc-aggressive):

git repack -a -d --depth=250 --window=250 
like image 33
Cœur Avatar answered Oct 12 '22 02:10

Cœur