Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do large files still exist in my packfile, after scrubbing them with filter-branch?

I've rewritten the history of my repository to remove some large FLV files using git filter-branch. I primarily followed the Github article article on removing sensitive data and similar instructions found elsewhere on the Internet:

Removing the large FLVs:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch public/video/*.flv' --prune-empty -- --all

Removing the original refs:

rm -rf .git/refs/original/

Clearing the reflog:

git reflog expire --expire=now --all

Pruning unreachable objects:

git gc --prune=now

Aggressivly pruning unreachable objects:

git gc --aggressive --prune=now

Repacking things:

git repack -A -d

And my gitdir is still 205 MB, contained almost entirely in a single packfile:

$ du -h .git/objects/pack/*
284K    .git/objects/pack/pack-f72ed7cee1206aae9a7a3eaf75741a9137e5a2fe.idx
204M    .git/objects/pack/pack-f72ed7cee1206aae9a7a3eaf75741a9137e5a2fe.pack

Using this script, I can see that the FLVs I've removed are still contained in the pack:

All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file.
size   pack   SHA                                       location
17503  17416  1be4132fa8d91e6ce5c45caaa2757b7ea87d87b0  public/video/XXX_FINAL.flv
17348  17261  b7aa83e187112a9cfaccae9206fc356798213c06  public/video/YYY_FINAL.flv
....

Cloning the repository via git clone --bare my-repo yields my-repo.git which is also 205MB in size.

What can I do to remove these (presumably) unreferenced objects from the pack and shrink my repository back to size it would be if they'd never been committed? If they are still referenced somehow, is there a way to tell where?

Update

Upon attempting to re-run git filter-branch, I received this notice:

Cannot create a new backup.
A previous backup already exists in refs/original/
Force overwriting the backup with -f

I verified that there were no refs in .git/refs/original, indeed, the directory didn't exist at all. Is there some other way that git stores refs, that I don't know about?

like image 822
meagar Avatar asked May 18 '12 16:05

meagar


People also ask

How can I remove a large file from my commit history?

If the large file was added in the most recent commit, you can just run: git rm --cached <filename> to remove the large file, then. git commit --amend -C HEAD to edit the commit.

How do I reduce the size of a .pack file in Git?

When you do a Git clone, it will create a copy of the whole repository, this includes the pack file as this is part of the repo too. The only way to reduce the size of the pack file will be by removing contents from your repo.

What is pack file in .Git folder?

The packfile is a single file containing the contents of all the objects that were removed from your filesystem. The index is a file that contains offsets into that packfile so you can quickly seek to a specific object.

What is Git filter branch?

Lets you rewrite Git revision history by rewriting the branches mentioned in the <rev-list options>, applying custom filters on each revision. Those filters can modify each tree (e.g. removing a file or running a perl rewrite on all files) or information about each commit.


1 Answers

Upon cloning a fresh copy of the repository, I was able to run the commands exactly as above, and achieve the desired result: My .git directory was reduced from 205 MB down to 20 MB, and the large FLV files were removed cleanly from the packfile.

The first attempt was also performed on a fresh clone to which I had made no modifications, so I do not have a satisfying explanation for why the FLV files continued to linger inside the packfile.

I originally submitted the below answer, thinking that I'd caused a problem by running git repack -a before removing .git/refs/original, causing the original refs to become packed so that when I did remove .git/refs/original there was no effect; my original refs would still be referencing the large FLV files. This doesn't seem to hold up, however. Running the above commands on a freshly cloned copy of the repository with the addition of git repack -a immediately after git filter-branch doesn't seem to affect the outcome - the FLV files are still purged from the packfile. I have no reason to believe this is relevant to the original problem.


Is there some other way that git stores refs, that I don't know about?

There is. It turns out I wasn't entirely truthful about the order of commands as listed above. I had run git repack -a before running rm -rf .git/refs/original, and Git had packed the refs away (to be determined where; experimenting now). When I then ran rm -rf .git/refs/original, nothing was removed. git gc was unable to shrink my packfile because I did still having lingering references to the old files due to the packed refs/original refs.

like image 186
meagar Avatar answered Oct 08 '22 03:10

meagar