I have a codebase that (until now) used git to store its dependencies. The repository itself is available here (warning: it's HUGE). Needless to say, I need to remove the dependencies from the repository history in order to cut it down to a reasonable size.
I started by using David Underhill's instructions to remove the lib
directory from the history. Even after doing this, however, the repository is still over 300M. Issuing git prune
and git repack
helps, but it's still over 180M.
In an attempt to find any bloated blobs, I issued
git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head
with these results:
105526b5d3d398b9989d88c2f9fc2d1dc96a85b8 blob 35685609 33600527 31978828 d296935e6ac5f3f58b50c789394c9769116e9c34 blob 35658016 33593241 112485744 50636f931180a32764edadd854968a971a083f8a blob 28360290 25897864 233390 b9e4dd37428e879a258f297b7f5bcfb9ba869695 blob 13108002 11640713 66661788 08d2720b2414aa07ce419b17d5f80c333c7313b7 blob 12551621 11124009 89231035 6197a478a461275a0396f20c28487e9ae619a5f9 blob 11975135 11058259 148211988 1 50636f931180a32764edadd854968a971a083f8a 549eb0c73776fd0ede27a2fcb03366f76f45a13c blob 9136086 8166649 166451273 5bc0a0f04a7004bc16cfab1c091c6b369fb74049 blob 9072616 8270262 80951514 741480238a6a6ce612cf089245dd46d6890fba9f blob 8858569 8080252 101294029 744226651c55b14c1aa8affb78fba4fdf02b577c blob 7412220 6766404 186825167
This is where I'm stuck. I can git show
these blobs and see that they look very much like jar files, but I can't figure out why they're still in the repo.
Various attempts to find their filenames failed.
git repack -a
, git repack -ad
, and git repack -Ad
all seem to have no effect.
If the large file was added in the most recent commit, you can just run: git rm --cached <filename> to remove the large file, then. git commit --amend -C HEAD to edit the commit.
If you commit sensitive data, such as a password or SSH key into a Git repository, you can remove it from the history. To entirely remove unwanted files from a repository's history you can use either the git filter-repo tool or the BFG Repo-Cleaner open source tool.
You can do so with ls -lh and ls -lh some_pattern . Add the files or file patterns you want to avoid version controlling (the large files) to your . gitignore file 2. Double check that your pattern worked by confirming that these files do not show up as untracked when you run git status .
--prune=now
on git gcAlthough you'd successfully written your unwanted objects out of history, it looks like those unwanted objects were not being pruned because they were too young to be pruned by default (see the configuration docs on git gc
for a bit more detail). Using git gc --prune=now
should handle that, or you could see this answer for a more nuclear option.
Although that should fix your final problem, an underlying problem was the difficulty of finding big blobs in order to remove them using git filter-branch
- to which I would say:
git filter-branch
is painful to use for a task like this, and there's a much better, less well-known tool called The BFG, specifically designed for removing Large Files from Git repos.
The core command to remove big files looks just like this:
$ bfg --strip-blobs-bigger-than 10MB my-repo.git
Any blob over 10MB in size (that isn't in your latest commit) will be totally removed from your repository's history - you don't have to manually find the files yourself, and files in protected commits are safe.
You can then use git gc
to clean away the dead data:
$ git gc --prune=now --aggressive
The BFG is typically hundreds of times faster than running git-filter-branch
on a big repo and the options are tailored around these two common use-cases:
Full disclosure: I'm the author of the BFG Repo-Cleaner.
Have you tried running git gc
? http://www.kernel.org/pub/software/scm/git/docs/git-gc.html
You need to run David Underhill's script on each branch in the repository to ensure the references are removed from all branches.
Then, as in the further discussion, initialize a new repository with git init
and either git pull
from the original or git remote add origin <original>
and then pull all branches.
$ du -sh ./BIG
299M ./BIG
$ cd BIG
$ git checkout master
$ git-remove-history REMOVE_ME
....
$ git checkout branch2
$ git-remove-history REMOVE_ME
...
$ cd ../SMALL
$ git init
$ git remote add origin ../BIG
$ git fetch --all
$ git checkout master
$ cd ..
$ du -sh ./SMALL ./BIG
26M ./SMALL
244M ./BIG
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With