Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bitbucket is alarming that my git repo is too large but I cannot confirm large files

Bitbucket is alarming that my Git repository is over 1 GB. Actually, in Repository details page it says it is 1.7 GB. That's crazy. I must have included large data files in the version control. My local repository is in fact 10 GB, which means that at least I have been using .gitignore successfully to some extent to exclude big files from version control.

Next, I followed the tutorial here https://confluence.atlassian.com/display/BITBUCKET/Reduce+repository+size and tried to delete unused large data. The command files.git count-objects -v at the top level folder of my repo returned the following:

count: 5149
size: 1339824
in-pack: 11352
packs: 2
size-pack: 183607
prune-packable: 0
garbage: 0
size-garbage: 0

The size-pack 183607 KB is much smaller than 1.7 GB. I was a bit perplexed.

Next I downloaded the BFG Repo Cleaner https://rtyley.github.io/bfg-repo-cleaner and ran the command java -jar bfg-1.12.3.jar --strip-blobs-bigger-than 100M at the top level directory to remove files bigger than 100 MB from all the not latest commits. However, BFG returned the following message:

Warning : no large blobs matching criteria found in packfiles 
- does the repo need to be packed?

Repeating the same for 50M resulted in the same.

Does this mean that all the files larger than 50 MB are in the latest commit? In Source code browser in Bitbucket, I looked at folders that contain large data files but those files are not included (successfully ignored).

Could anyone explain briefly what is the source of confusion about the repository size and existence of large files in the repo?

like image 418
Kouichi C. Nakamura Avatar asked Feb 28 '15 09:02

Kouichi C. Nakamura


2 Answers

At this point you would need to look at the repository on the server to know with certainty what the problem is, and you will likely need to talk to BitBucket technical support. But your description makes it sound like your repository has some garbage in it that can be cleaned up.

Consider if you had pushed some 500 MB file up to your BitBucket repository. Now you realize your error, and remove it from your repository in some way (BFG, for example) and push that updated ref. The ref on your remote will be updated to point to the new commit, and your repository will not appear to contain the big file (if you cloned your repository, you would not get the big file).

But the remote would not have gone and deleted the old commit or the old file in that commit. It would merely disconnect it from the graph, and that large file would no longer be "reachable". It would, in fact, be "garbage" eligible for "garbage collection". This would delete the big file and your repository size on the server would shrink.

There is no way to ask the server to GC (over the git protocol). BitBucket's support should be able to perform this for you:

You'll need to look for us to trigger the gc instead. I guess the best way is to "escalate" it if it is really urgent, and we should be able to get to it immediately. — Bitbucket Support (Dec. 2016)

Note that this assumes that you actually have the full repository locally, make sure to do a fetch --all to ensure that you don't have a subset of (reachable) history locally. In case of BFG, make sure you've cloned your repository with the --mirror option.

like image 159
Edward Thomson Avatar answered Oct 11 '22 13:10

Edward Thomson


We think we had the same problem today and were able to solve it without contacting Bitbucket support, as below. Note that the method discards last commit from the repo - so you probably want to have its backup.

Bitbucket reported that our repo was about 2.1GB, while when cloned, it only took about 250MB locally. From this, we concluded that it's most likely from big files in unreachable commits (thanks to Edward's answer above).

This is how to see unreachable commits locally, where we don't take into account reachability via reflog:

git fsck --unreachable --no-reflog

Locally, unreachable commits can be cleaned with:

git reflog expire --expire-unreachable="now" --all
git prune --expire="now" -v
git gc --aggressive --prune="now"

We cannot however run any of these commands remotely on Bitbucket. But, they say on the page about reducing repo size (section Remove the repository limitation) that they run git gc themselves in response to doing git reset --hard HEAD~1 (which discards last commit) followed by git push -f. Also, they say in the section Garbage collecting dead data that one can try the sequence: git reflog expire --expire=now --allgit gc --prune=nowgit push --all --force. Given all this, I decided to try the following locally, hoping it'd cut out the reflog and do a prune locally, and then push them to remote Bitbucket repository, on which it'd start a gc:

git reflog expire --expire-unreachable="30m" --all
git prune --expire="30m" -v
git gc --prune="30m"
git reset --hard HEAD~1
git push -f

This worked, repo size immediately went from 2.1GB to ca. 250MB. :)

Note that the time param to expire / expire-unreachable / prune sets the expiration cut-off point measuring from now back. So e.g. "now" means expire / prune everything, and "30m" means except for changes in last 30 minutes.


Edit:

One thing that comes to mind on reflection is that since git expires unreachable reflog entries by default after 30 days, it's possible that my command sequence worked not because I ran git reflog expire, git prune and git gc locally (which perhaps didn't get pushed to remote repo), but because the remote git gc triggered by git reset removed all the unreachable commits older than 30 days.

So, it may be that the following would have had the same effect for me:

git reset --hard HEAD~1
git push -f

And for unreachable changes made in the last 30 days I'd still need to contact Bitbucket support.

like image 34
Jan Żankowski Avatar answered Oct 11 '22 13:10

Jan Żankowski