Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

After deleting a binary file from Git history why is my repository still large?

So let me preface this question by saying that I am aware of the previous questions pertaining to subject on Stackoverflow. In fact I've tried all the solutions I could find but there is a binary file in my repo that just refuses to be removed and continues to greatly inflate my repo size.

Methods I've tried,

  • David Underhill's script
  • Github's Howto

Both of which were recommend by the Darhuuk's answer to Remove files from git repo completely

However, after trying both of those solutions the script to find large files in git still finds the offending binary. However the script from this answer no longer finds the commit for the binary. Both of these scripts were suggest by this answer.

The repo is still 44mb after the attempts at removal, which is way too large for the relative small size of the source. Which suggestions the large file script is doing it's job properly. I've tried pushing up to github (I made a fork just in case) and then doing a fresh clone to see if the repo size was decreased, but it is still the same size.

Can someone explain what I am doing wrong or suggest an alternative method?

I should note that I am not just interested in trimming the file from my local repo, I also want to be able to fix the remote repo on Github.

like image 300
James McMahon Avatar asked Jun 29 '12 03:06

James McMahon


People also ask

How do I completely delete a file from Git history?

The easiest way to delete a file in your Git repository is to execute the “git rm” command and to specify the file to be deleted. Note that by using the “git rm” command, the file will also be deleted from the filesystem.

How do you remove delete a large file from commit history in the git repository?

If the large file was added in the most recent commit, you can just run: git rm --cached <filename> to remove the large file, then. git commit --amend -C HEAD to edit the commit.

Why is my git repo so large?

So, your entire git content will be less than your actual source code size. But, even in that case, you keep on committing large files, your git repo size may increase due to the version history. You have to reduce your git repo size in order to work it seamlessly.


4 Answers

2017 Edit: You should probably look into BFG Repo-Cleaner if you are reading this.


So embarrassingly the reason why my local repos were not shrinking in size is because I was using the wrong path to the file in filter-branch. So while I thank J-16 SDiZ and CodeGnome for their answers my problem was between the chair and the keyboard.

In an effort to make this question less of a monument to my stupidity and actually useful to people I've taken the time to write up the steps one would have to go through after trimming the repo in order to get the repo back up on Github. Hope this helps someone out down the line.


Removing offending files

To go about remove the offending files run the shell script below, based the Github remove sensitive data howto

#!/usr/bin/env bash
git filter-branch --index-filter 'git rm -r -q --cached --ignore-unmatch '$1'' --prune-empty --tag-name-filter cat -- --all

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now

I went through every branch on my local repository and did this, but I am honestly not sure if this is needed, (you don't need to do this on every branch) you do however need every branch local for the next step, so keep that in mind. Once you are done you should see the size decrease in your local repo. You should also be able to run the blob script in CodeGnome's answer and see the offending blob remove. If not double check the file name and path and make sure they are correct.

What git filter-branch is actually doing here is running the command listed in quotes on each commit in the repo.

The rest of the script just cleans any cached version of the old data.

Pushing the trimmed repo

Now that the local repo is in the state you need it to be the trick is to get it back up on Github. Unfortunately as far as I can tell there is no way to completely remove the binary data from the Github repo, here is the quote from the Github sensitive data howto

Be warned that force-pushing does not erase commits on the remote repo, it simply introduces new ones and moves the branch pointer to point to them. If you are worried about users accessing the bad commits directly via SHA1, you will have to delete the repo and recreate it.

It sucks that you need to recreate the Github repo, but the good news that recreating the repo is actually pretty easy. The pain is that you also have to recreating the data in issues and the wiki, which I'll go into below.

What I recommend is creating a new repo in github and then switch it out with your old repo when you are ready. This can be done by renaming the old to something like "repo name old" and then changing the name of the newly created repo to "repo name". Make sure when you create the new repo to uncheck initialize with README, otherwise your not going to be dealing with a clean slate.

If you completed the last step you should have your repo cleaned and ready to go. The remotes now need to changed to match the new Github repo location. I do this by editing the .git/config file directly, though I am sure someone is going to tell me that is not the right way to do it.

Before doing the push make sure you have all branches and tags you want to push up in your local repo. Once you are ready push all branches using the follow

git push --all
git push --tags

Now you should have a remote repo to match your trimmed local repo. Double check that all data made just in case.

Now if you don't have to worry about issues or the wiki you are done. If you do read on.

Moving over wikis

The Github wiki is just another repo associated with your main repo. So to get started clone your old wiki repo somewhere. Then the next part is kind of tricky, as far as I can tell you need to click on the wiki tab of your new repo in order to create the wiki, but it seeds the newly created wiki with a an initial file. So what I did, and I am not sure if there is a better way, is change the remote to the newly create wiki repo and do a push to the new location using

git push --all --force

The force is needed here because otherwise git will complain about the tip of the current branch not matching. I think this may leave the initial page in a detached state in the git repo, but the effect of that on the size of the repo should be negligible.

Moving over issues

There is advice on this given by this answer. But looking at the script linked in the answer it looks like it is fairly incomplete, there is a TODO for comment importing and I couldn't tell if it would be bring over the state of issues or not.

So given that I had a fairly small open issues queue and that I didn't mind losing closed issues I elected to bring things over by hand. Note that it is impossible to do this with proper attribution to other people on comments. So I think for a large more established project you would need to write a more robust script to bring everything over, but that wasn't needed for my particular case.

like image 169
James McMahon Avatar answered Oct 08 '22 23:10

James McMahon


Assuming that you've already removed the blob from your history with git-filter-branch(1) and friends, Git often keeps things around in the reflogs, packfiles, and loose repository objects. The incantation to remove these unreferenced objects is:

git prune --expire=now
git reflog expire --expire-unreachable=now --rewrite --all
git repack -a -d
git prune-packed

If you've done this and you still have a bigger repository than you think you should, then you still have references to your blob somewhere in the repository. You'll have to go back to step one and remove them. This may help:

# List all blobs by size in bytes.
git rev-list --all --objects   |
    awk '{print $1}'           |
    git cat-file --batch-check |
    fgrep blob                 |
    sort -k3nr
like image 40
Todd A. Jacobs Avatar answered Oct 08 '22 22:10

Todd A. Jacobs


The script in script to find large files in git check the .pack file -- that is, the raw object repository. The second script shows the large object is no longer referenced. If you really want to clean that up, you may do a gc and repack:

git gc --aggressive --prune=now
git repack -A -d

If this still don't help, you may have an object reference in remote branch, you may try

  1. Find out which commit have this object, see Which commit has this blob? and do git branch -a --contains <commit-ish>
  2. Remove the remote branch using git branch -r -D branchname

Update -- What is a "remote branch"?

  • Remote branch is what git fetch things to when you do a git fetch / git pull. (git pull is same as git fetch refspec + git merge remote-branch.

  • If you clone from a remote repository, deleting the remote branch should have no ill effect -- you can always fetch/pull from the remote again using something like git fetch origin refs/heads/master:refs/remotes/origin/master (this pull the master branch from remote to the remote branch remotes/origin/master).

  • If this branch was created by you, deleting should be okay too -- because you should have a "normal" (tracking) branch for that. But you should double confirm this.

like image 30
J-16 SDiZ Avatar answered Oct 08 '22 23:10

J-16 SDiZ


Can someone explain what I am doing wrong or suggest an alternative method?

Have you tried applying DMAIC? Define, Measure, Analyze, Improve, Control.

D - My repo is still large after deleting a file from git history.
M - Determine size of fresh repo using git init to establish baseline.
A - Identify, validate and select root cause. Experiment with git-repo-analysis.
I - Identify, test and implement solution. Maybe BFG Repo-Cleaner will help. Maybe it won't.
C - Sustain the gains. Look at something like Git LFS or other appropriate control method.

I also want to be able to fix the remote repo on Github.

This will depend on how you choose to resolve the problem. For exaple, when using BFG to trim files from history it'll rewrite history and update commit SHAs so there's going to be some give and take here depending on your specific needs and desired outcomes.

like image 44
vhs Avatar answered Oct 09 '22 00:10

vhs