Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reduce size of git repository on Bitbucket

Tags:

git

bitbucket

After few months of (commit & push) for my project, the size of the repository gets increased gradually on Bitbucket! it's about 1 GB, I tried to remove some databases folders that are not important to be added. After searching I found most of suggestions is proposing :

git filter-branch -f --tree-filter 'rm -rf folder/subfolder' HEAD

After removing few folders I push the change to the repository by -- force, as

git push origin master --force

I finally found that the repository gets larger every time I use those commands !!. Visibly, the repository gets larger 2.5 GB!!

Any suggestion please ?

EDIT Depending on the suggestion below, I tried the following commands
(for all large files)

git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" --tag-name-filter cat -- --all

(remove the temporary history git-filter-branch otherwise leaves behind for a long time)

rm -rf .git/refs/original/

git reflog expire --all
git gc --aggressive --prune

But the folder .git/objects has still a big size !!!!

like image 628
Y.AL Avatar asked Aug 20 '14 08:08

Y.AL


1 Answers

OK, given your answer to your comment, we can now say what happened.

What git filter-branch does is to copy (some or all of) your commits to new ones, then update the references. This means your repository gets bigger (not smaller), at least initially.

The commits that are copied are those reachable via the references given. In this case, the reference you gave is HEAD (which git turns into "your current branch", probably master, but whatever your current branch was at the time of the filter-branch command). If (and only if) the new copy is precisely, bit-for-bit identical to the original, then it actually is the original and there is no actual copy made (the original is reused instead). However, as soon as you make any change—such as removing folder/subfolder, from that point on these really are copies.

The copied stuff is, in this case, smaller, because you've removed some items. (It's generally not very much smaller since git compresses items pretty well.) But you're still adding more stuff to the repository: new commits, which refer to new trees, which—fortunately—refer to the same old blobs (file objects) as before, just slightly fewer of them this time (the objects for the folder/subfolder files are still in the repository, but the copied commits and tree-objects no longer refer to them).

Pictorially, at this point in the filter-branch process, we now have both the old commits:

R--o--o---o--o   <-- master
    \    /
     o--o        <-- feature

and the new ones (I'll assume folder/subfolder appeared in the original root commit R so that we have a copy R' here):

R'-o'-o'--o'-o'
    \    /
     o'-o'

What filter-branch does now, at the end of the copying process, is re-point some references (branch and tag names, mainly). The ones it re-points are the ones you tell it to, by mentioning them as what the documentation calls "positive references". In this case, if you were on master (i.e., HEAD was another name for master), the single positive reference you gave is master ... so that's all filter-branch re-points. It also makes backup references whose name starts with refs/original/. This means you now have the following commits:

R--o--o---o--o   <-- refs/original/refs/heads/master
    \    /
     o--o        <-- feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'

Note that feature still points to all the old (not-copied) commits, so that even if / after you get rid of any refs/original/ references, git will retain all the still-referenced commits across any garbage-collect activity, giving:

R--o
    \
     o--o        <-- feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'

To get filter-branch to update all the references, you need to name them all. An easy way to do that is to use --all, which quite literally names all references. In this case, the initial "after" picture looks like this instead:

R--o--o---o--o   <-- refs/original/refs/heads/master
    \    /
     o--o        <-- refs/original/refs/heads/feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'       <-- feature

Now if you erase all the refs/original/ references, all the old commits become unreferenced and can get garbage-collected. Well, that is, they do unless there are tags pointing to them.

For tag references, filter-branch only updates them in any way if you supply a --tag-name-filter. Usually you want --tag-name-filter cat, which keeps the tag names unchanged, but makes filter-branch point them to the newly copied commits. That way you don't hang on to the old commits: the whole point of the exercise is to make everything use the new copies, and throw away the old copies, so that the big-file objects can be garbage-collected.


Putting this all together, instead of:

git filter-branch -f --tree-filter 'rm -rf folder/subfolder'

you can use:

git filter-branch -f --tree-filter 'rm -rf folder/subfolder' \
    --tag-name-filter cat -- --all

(You don't need the backslash-newline sequence; I put that in just to make the line fit better on stackoverflow. Note that --tree-filter is very slow: for this particular case it is much faster to use --index-filter. The index filter command here would be git rm --cached --ignore-unmatch -r folder/subfolder.)

Note also that you need to do all this on (a copy of) the original repository (you did keep a backup, right?). (If you did not keep a backup, the refs/originals/ may be your salvation.)


Edit: OK, so you did some filter-branch-ing, and you did something that deleted any refs/originals/. (In my experiment on a temp repo, running git filter-branch on HEAD used whatever branch I was on as the branch that was re-pointed, and made an "originals" copy of the previous value.) There are no backups of the repository. Now what?

Well, as a first step, make a backup now. This way, if things get any worse, you can at least get back to "only slightly bad". To make a backup of the repo, you can simply clone it (or: clone it, then call the original the "backup", then begin working on the clone). For future reference, since git filter-branch can be quite destructive, it's usually wise to start by doing this backing-up process. (Also, I'll note that a clone on bitbucket, when not yet pushed-to, would serve. Unfortunately you did a push. Perhaps bitbucket can retrieve an earlier version of the repository from some backups or snapshots of their own.)

Next, let's note a peculiarity of commits and their SHA-1 "true names", that I mentioned earlier. The SHA-1 name of a commit is a cryptographic checksum of its contents. Let's take a look at a sample commit in git's own source tree (trimmed down a bit just for length, and email addresses whacked to foil harvesters):

$ git cat-file -p 5de7f500c13c8158696a68d86da1030313ddaf69
tree 73eee5d136d2b00c623c3fceceffab85c9e9b47e
parent c4ad00f8ccb59a0ae0735e8e32b203d4bd835616
author Jeff King <peff peff.net> 1405233728 -0400
committer Junio C Hamano <gitster pobox.com> 1406567673 -0700

alloc: factor out commit index

We keep a static counter to set the commit index on newly
allocated objects. However, since we also need to set the
[snip]

Here, we can see that the contents of this commit (whose "true name" is 5de7f50...) start with a tree and another SHA-1, a parent and another SHA-1, an author and committer, then a blank line followed by the commit message text.

If you look at a tree you'll see that it contains the "true names" (SHA-1 values) of sub-trees (sub-directories) and file objects ("blobs", in git terminology) along with their modes—really, just whether the blob should have execute permission set, or not—and their names within the directory. For instance, the first line of the above tree is:

100644 blob 5e98806c6cc246acef5f539ae191710a0c06ad3f    .gitattributes

which means that the repository object 5e98806... should be extracted, put in a file named .gitattributes, and set non-executable.

If I ask git to make a new commit, and set up, as its contents:

  • the same tree (73eee5d...)
  • the same parent (c4ad00f...)
  • the same author and committer
  • and the same blank line and message

then when I get git to write that commit to the repository, it will generate the same "true name" 5de7f50.... In other words, it literally is the same commit: it's already in the repository and git commit-tree will just give me back the existing ID. While it's a bit tricky to set all this up, that's exactly what git filter-branch ends up doing: it extracts the original commit, applies your filters, sets up everything, and then does a git commit-tree.

What this means for you

On your original repo, you ran a git filter-branch command that copied commits to new, modified commits (with different trees and hence, at some point, different true names which led to different parent IDs in subsequent commits, and so on). However, if you copy those copied commits by applying a filter that this time does nothing, then the new tree objects will be the same as the old ones. If the new parent is the same, and the author, committer, and message also all remain the same, the new commit-ID for the copy will be the same as the old ID.

That is, these new copies are not copies after all, they're just the originals again!

Any other commits—those that were not copied in the first pass—do get copied, and hence have different IDs.

Here's where things get tricky.

If your current repository looks like this (graphically speaking):

R--o--o---o--o   <-- xxx [needs a name so that filter-branch will process it]
    \    /
     o--o        <-- feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'

and we apply a new filter-branch to all references (or even "all but master") in such a way that it generates the same trees this time, it will copy R again and the new tree will match that for R', so the copy will actually be R'. Then it will copy the first post-R node, make the same changes, and the copy will actually be the first post-R', o' node. This will repeat for all nodes, possibly even including R' and all the o's. If filter-branch copies R', the resulting copy will just be R' again, though, because "remove nonexistent directory" makes no change: our filter does nothing to these particular commits.

Finally, filter-branch will move the labels, leaving the refs/originals/ versions behind:

R--o--o---o--o   <-- refs/originals/refs/xxx
    \    /
     o--o        <-- refs/originals/refs/feature

R'-o'-o'--o'-o'  <-- master, xxx
    \    /
     o'-o'       <-- feature

This is, in fact, the desired outcome.

What if the repository looks more like this? That is, what if there is no xxx or similar label pointing to the original (pre-filtering) master, so that you have this:

R--o
    \
     o--o        <-- feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'

The filter-branch script will still copy R and the result will still be R'. Then it will copy the first o node and the result will still be the first o' node, and so on. It won't copy the now-deleted nodes, but it won't have to: we already have those, reachable via the branch-name master. As before, filter-branch may copy R' and the various o' nodes, but this is OK, as the filter will do nothing so that the copies are really just the originals after all.

Last, filter-branch will, as usual, update the references:

R--o
    \
     o--o        <-- refs/originals/refs/feature

R'-o'-o'--o'-o'  <-- master
    \    /
     o'-o'       <-- feature

The key that makes this all work is that the filter leaves already-modified commits untouched, so that their second "copies" are just the first-copies again.1

Once everything is done, you can do the same shrinking described in the git filter-branch documentation to ditch the refs/originals/ names and garbage-collect the now-unreferenced objects.


1If you had been using a filter that is not as easily repeated (e.g., one that makes new commits with "the current time" as their time-stamps), you would really need an untouched original repository, or those refs/originals/ references (either one would suffice to keep an "original copy" around).

like image 176
torek Avatar answered Oct 07 '22 09:10

torek