After few months of (commit & push) for my project, the size of the repository gets increased gradually on Bitbucket! it's about 1 GB, I tried to remove some databases folders that are not important to be added. After searching I found most of suggestions is proposing :
git filter-branch -f --tree-filter 'rm -rf folder/subfolder' HEAD
After removing few folders I push the change to the repository by -- force, as
git push origin master --force
I finally found that the repository gets larger every time I use those commands !!. Visibly, the repository gets larger 2.5 GB!!
Any suggestion please ?
EDIT
Depending on the suggestion below, I tried the following commands
(for all large files)
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" --tag-name-filter cat -- --all
(remove the temporary history git-filter-branch otherwise leaves behind for a long time)
rm -rf .git/refs/original/
git reflog expire --all
git gc --aggressive --prune
But the folder .git/objects has still a big size !!!!
OK, given your answer to your comment, we can now say what happened.
What git filter-branch
does is to copy (some or all of) your commits to new ones, then update the references. This means your repository gets bigger (not smaller), at least initially.
The commits that are copied are those reachable via the references given. In this case, the reference you gave is HEAD
(which git turns into "your current branch", probably master
, but whatever your current branch was at the time of the filter-branch
command). If (and only if) the new copy is precisely, bit-for-bit identical to the original, then it actually is the original and there is no actual copy made (the original is reused instead). However, as soon as you make any change—such as removing folder/subfolder
, from that point on these really are copies.
The copied stuff is, in this case, smaller, because you've removed some items. (It's generally not very much smaller since git compresses items pretty well.) But you're still adding more stuff to the repository: new commits, which refer to new trees, which—fortunately—refer to the same old blobs (file objects) as before, just slightly fewer of them this time (the objects for the folder/subfolder
files are still in the repository, but the copied commits and tree-objects no longer refer to them).
Pictorially, at this point in the filter-branch
process, we now have both the old commits:
R--o--o---o--o <-- master
\ /
o--o <-- feature
and the new ones (I'll assume folder/subfolder
appeared in the original root commit R
so that we have a copy R'
here):
R'-o'-o'--o'-o'
\ /
o'-o'
What filter-branch
does now, at the end of the copying process, is re-point some references (branch and tag names, mainly). The ones it re-points are the ones you tell it to, by mentioning them as what the documentation calls "positive references". In this case, if you were on master
(i.e., HEAD
was another name for master
), the single positive reference you gave is master
... so that's all filter-branch
re-points. It also makes backup references whose name starts with refs/original/
. This means you now have the following commits:
R--o--o---o--o <-- refs/original/refs/heads/master
\ /
o--o <-- feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o'
Note that feature
still points to all the old (not-copied) commits, so that even if / after you get rid of any refs/original/
references, git will retain all the still-referenced commits across any garbage-collect activity, giving:
R--o
\
o--o <-- feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o'
To get filter-branch
to update all the references, you need to name them all. An easy way to do that is to use --all
, which quite literally names all references. In this case, the initial "after" picture looks like this instead:
R--o--o---o--o <-- refs/original/refs/heads/master
\ /
o--o <-- refs/original/refs/heads/feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o' <-- feature
Now if you erase all the refs/original/
references, all the old commits become unreferenced and can get garbage-collected. Well, that is, they do unless there are tags pointing to them.
For tag references, filter-branch
only updates them in any way if you supply a --tag-name-filter
. Usually you want --tag-name-filter cat
, which keeps the tag names unchanged, but makes filter-branch
point them to the newly copied commits. That way you don't hang on to the old commits: the whole point of the exercise is to make everything use the new copies, and throw away the old copies, so that the big-file objects can be garbage-collected.
Putting this all together, instead of:
git filter-branch -f --tree-filter 'rm -rf folder/subfolder'
you can use:
git filter-branch -f --tree-filter 'rm -rf folder/subfolder' \
--tag-name-filter cat -- --all
(You don't need the backslash-newline sequence; I put that in just to make the line fit better on stackoverflow. Note that --tree-filter
is very slow: for this particular case it is much faster to use --index-filter
. The index filter command here would be git rm --cached --ignore-unmatch -r folder/subfolder
.)
Note also that you need to do all this on (a copy of) the original repository (you did keep a backup, right?). (If you did not keep a backup, the refs/originals/
may be your salvation.)
Edit: OK, so you did some filter-branch
-ing, and you did something that deleted any refs/originals/
. (In my experiment on a temp repo, running git filter-branch
on HEAD
used whatever branch I was on as the branch that was re-pointed, and made an "originals" copy of the previous value.) There are no backups of the repository. Now what?
Well, as a first step, make a backup now. This way, if things get any worse, you can at least get back to "only slightly bad". To make a backup of the repo, you can simply clone it (or: clone it, then call the original the "backup", then begin working on the clone). For future reference, since git filter-branch
can be quite destructive, it's usually wise to start by doing this backing-up process. (Also, I'll note that a clone on bitbucket, when not yet push
ed-to, would serve. Unfortunately you did a push
. Perhaps bitbucket can retrieve an earlier version of the repository from some backups or snapshots of their own.)
Next, let's note a peculiarity of commits and their SHA-1 "true names", that I mentioned earlier. The SHA-1 name of a commit is a cryptographic checksum of its contents. Let's take a look at a sample commit in git's own source tree (trimmed down a bit just for length, and email addresses whacked to foil harvesters):
$ git cat-file -p 5de7f500c13c8158696a68d86da1030313ddaf69
tree 73eee5d136d2b00c623c3fceceffab85c9e9b47e
parent c4ad00f8ccb59a0ae0735e8e32b203d4bd835616
author Jeff King <peff peff.net> 1405233728 -0400
committer Junio C Hamano <gitster pobox.com> 1406567673 -0700
alloc: factor out commit index
We keep a static counter to set the commit index on newly
allocated objects. However, since we also need to set the
[snip]
Here, we can see that the contents of this commit (whose "true name" is 5de7f50...
) start with a tree
and another SHA-1, a parent
and another SHA-1, an author
and committer
, then a blank line followed by the commit message text.
If you look at a tree
you'll see that it contains the "true names" (SHA-1 values) of sub-trees (sub-directories) and file objects ("blobs", in git terminology) along with their modes—really, just whether the blob should have execute permission set, or not—and their names within the directory. For instance, the first line of the above tree
is:
100644 blob 5e98806c6cc246acef5f539ae191710a0c06ad3f .gitattributes
which means that the repository object 5e98806...
should be extracted, put in a file named .gitattributes
, and set non-executable.
If I ask git to make a new commit, and set up, as its contents:
73eee5d...
)c4ad00f...
)then when I get git to write that commit to the repository, it will generate the same "true name" 5de7f50...
. In other words, it literally is the same commit: it's already in the repository and git commit-tree
will just give me back the existing ID. While it's a bit tricky to set all this up, that's exactly what git filter-branch
ends up doing: it extracts the original commit, applies your filters, sets up everything, and then does a git commit-tree
.
On your original repo, you ran a git filter-branch
command that copied commits to new, modified commits (with different tree
s and hence, at some point, different true names which led to different parent IDs in subsequent commits, and so on). However, if you copy those copied commits by applying a filter that this time does nothing, then the new tree
objects will be the same as the old ones. If the new parent is the same, and the author, committer, and message also all remain the same, the new commit-ID for the copy will be the same as the old ID.
That is, these new copies are not copies after all, they're just the originals again!
Any other commits—those that were not copied in the first pass—do get copied, and hence have different IDs.
Here's where things get tricky.
If your current repository looks like this (graphically speaking):
R--o--o---o--o <-- xxx [needs a name so that filter-branch will process it]
\ /
o--o <-- feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o'
and we apply a new filter-branch
to all references (or even "all but master
") in such a way that it generates the same trees this time, it will copy R
again and the new tree will match that for R'
, so the copy will actually be R'
. Then it will copy the first post-R
node, make the same changes, and the copy will actually be the first post-R'
, o'
node. This will repeat for all nodes, possibly even including R'
and all the o'
s. If filter-branch
copies R'
, the resulting copy will just be R'
again, though, because "remove nonexistent directory" makes no change: our filter does nothing to these particular commits.
Finally, filter-branch will move the labels, leaving the refs/originals/
versions behind:
R--o--o---o--o <-- refs/originals/refs/xxx
\ /
o--o <-- refs/originals/refs/feature
R'-o'-o'--o'-o' <-- master, xxx
\ /
o'-o' <-- feature
This is, in fact, the desired outcome.
What if the repository looks more like this? That is, what if there is no xxx
or similar label pointing to the original (pre-filtering) master
, so that you have this:
R--o
\
o--o <-- feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o'
The filter-branch
script will still copy R
and the result will still be R'
. Then it will copy the first o
node and the result will still be the first o'
node, and so on. It won't copy the now-deleted nodes, but it won't have to: we already have those, reachable via the branch-name master
. As before, filter-branch
may copy R'
and the various o'
nodes, but this is OK, as the filter will do nothing so that the copies are really just the originals after all.
Last, filter-branch
will, as usual, update the references:
R--o
\
o--o <-- refs/originals/refs/feature
R'-o'-o'--o'-o' <-- master
\ /
o'-o' <-- feature
The key that makes this all work is that the filter leaves already-modified commits untouched, so that their second "copies" are just the first-copies again.1
Once everything is done, you can do the same shrinking described in the git filter-branch
documentation to ditch the refs/originals/
names and garbage-collect the now-unreferenced objects.
1If you had been using a filter that is not as easily repeated (e.g., one that makes new commits with "the current time" as their time-stamps), you would really need an untouched original repository, or those refs/originals/
references (either one would suffice to keep an "original copy" around).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With