Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git repo still huge after large files removed from repository history

I have a codebase that (until now) used git to store its dependencies. The repository itself is available here (warning: it's HUGE). Needless to say, I need to remove the dependencies from the repository history in order to cut it down to a reasonable size.

I started by using David Underhill's instructions to remove the lib directory from the history. Even after doing this, however, the repository is still over 300M. Issuing git prune and git repack helps, but it's still over 180M.

In an attempt to find any bloated blobs, I issued

git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head

with these results:

105526b5d3d398b9989d88c2f9fc2d1dc96a85b8 blob 35685609 33600527 31978828 d296935e6ac5f3f58b50c789394c9769116e9c34 blob 35658016 33593241 112485744 50636f931180a32764edadd854968a971a083f8a blob 28360290 25897864 233390 b9e4dd37428e879a258f297b7f5bcfb9ba869695 blob 13108002 11640713 66661788 08d2720b2414aa07ce419b17d5f80c333c7313b7 blob 12551621 11124009 89231035 6197a478a461275a0396f20c28487e9ae619a5f9 blob 11975135 11058259 148211988 1 50636f931180a32764edadd854968a971a083f8a 549eb0c73776fd0ede27a2fcb03366f76f45a13c blob 9136086 8166649 166451273 5bc0a0f04a7004bc16cfab1c091c6b369fb74049 blob 9072616 8270262 80951514 741480238a6a6ce612cf089245dd46d6890fba9f blob 8858569 8080252 101294029 744226651c55b14c1aa8affb78fba4fdf02b577c blob 7412220 6766404 186825167

This is where I'm stuck. I can git show these blobs and see that they look very much like jar files, but I can't figure out why they're still in the repo.

Various attempts to find their filenames failed.

git repack -a, git repack -ad, and git repack -Ad all seem to have no effect.

like image 942
Aaron Novstrup Avatar asked Jul 30 '11 16:07

Aaron Novstrup


People also ask

How do you remove delete a large file from commit history in the git repository?

If the large file was added in the most recent commit, you can just run: git rm --cached <filename> to remove the large file, then. git commit --amend -C HEAD to edit the commit.

Can you remove a file from git history?

If you commit sensitive data, such as a password or SSH key into a Git repository, you can remove it from the history. To entirely remove unwanted files from a repository's history you can use either the git filter-repo tool or the BFG Repo-Cleaner open source tool.

How do I ignore a large file in git?

You can do so with ls -lh and ls -lh some_pattern . Add the files or file patterns you want to avoid version controlling (the large files) to your . gitignore file 2. Double check that your pattern worked by confirming that these files do not show up as untracked when you run git status .


3 Answers

Use --prune=now on git gc

Although you'd successfully written your unwanted objects out of history, it looks like those unwanted objects were not being pruned because they were too young to be pruned by default (see the configuration docs on git gc for a bit more detail). Using git gc --prune=now should handle that, or you could see this answer for a more nuclear option.

Although that should fix your final problem, an underlying problem was the difficulty of finding big blobs in order to remove them using git filter-branch - to which I would say:

...don't use git filter-branch

git filter-branch is painful to use for a task like this, and there's a much better, less well-known tool called The BFG, specifically designed for removing Large Files from Git repos.

The core command to remove big files looks just like this:

$ bfg  --strip-blobs-bigger-than 10MB  my-repo.git

Any blob over 10MB in size (that isn't in your latest commit) will be totally removed from your repository's history - you don't have to manually find the files yourself, and files in protected commits are safe.

You can then use git gc to clean away the dead data:

$ git gc --prune=now --aggressive

The BFG is typically hundreds of times faster than running git-filter-branch on a big repo and the options are tailored around these two common use-cases:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

Full disclosure: I'm the author of the BFG Repo-Cleaner.

like image 181
Roberto Tyley Avatar answered Oct 20 '22 22:10

Roberto Tyley


Have you tried running git gc? http://www.kernel.org/pub/software/scm/git/docs/git-gc.html

like image 33
Clueless Avatar answered Oct 20 '22 20:10

Clueless


You need to run David Underhill's script on each branch in the repository to ensure the references are removed from all branches.

Then, as in the further discussion, initialize a new repository with git init and either git pull from the original or git remote add origin <original> and then pull all branches.

$ du -sh ./BIG
299M ./BIG
$ cd BIG
$ git checkout master
$ git-remove-history REMOVE_ME
....
$ git checkout branch2
$ git-remove-history REMOVE_ME
...
$ cd ../SMALL
$ git init
$ git remote add origin ../BIG
$ git fetch --all
$ git checkout master
$ cd ..
$ du -sh ./SMALL ./BIG
26M ./SMALL
244M ./BIG
like image 21
stephenhouser Avatar answered Oct 20 '22 22:10

stephenhouser