We have a number of git
repositories which have grown to an unmanageable size due to the historical inclusion of binary test files and java .jar
files.
We are just about to go through the exercise of git filter-branch
ing these repositories, re-cloning them everywhere they are used (from dozens to hundreds of deployments each, depending on the repo) and given the problems with rewriting history I was wondering if there might be any other solutions.
Ideally I would like to externalise problem files without rewriting the history of each repository. In theory this should be possible because you are checking out the same files, with the same sizes and the same hashes, just sourcing them from a different place (a remote rather than the local object store). Alas none of the potential solutions I have found so far appear to allow me to do this.
Starting with git-annex, the closest I could find to a solution to my problem was How to retroactively annex a file already in a git repo, but as with just removing the large files, this requires the history to be re-written to convert the original git add
into a git annex add
.
Moving on from there, I started looking at other projects listed on what git-annex is not, so I examined git-bigfiles, git-media and git-fat. Unfortunately we can't use the git-bigfiles fork of git
since we are an Eclipse shop and use a mixture of git
and EGit. It doesn't look like git-media or git-fat can do what I want either, since while you could replace existing large files with the external equivalents, you would still need to rewrite the history in order to remove large files which had already been committed.
So, is it possible to slim a .git repository without rewriting history, or should we go back to the plan of using git filter-branch
and a whole load of redeployments?
As an aside, believe that this should be possible, but is probably tied to the same limitations as those of git
s current shallow clone implementation.
Git already supports multiple possible locations for the same blob, since any given blob could be in the loose object store (.git/objects
) or in a pack file (.git/objects) so theoretically you would just need something like git-annex
to be hooked in at that level rather than higher up (i.e. have the concept of a download on demand remote blob if you like). Unfortunately I can't find anyone having implemented or even suggested anything like this.
That does indeed explain why it should never be done... Because of the way Git identifies commits by their content and all previous commits, any change (however minor) to a commit will look like a totally new branch of development to Git. There is no way to make a rewritten history look "almost the same".
Squash commits for a clean historyCommits marked with pick will have a new ID if the previous commits have been rewritten. Modern Git hosting solutions like Bitbucket now offer "auto squashing" features upon merge.
Sort of. You can use Git's replace feature to set aside the big bloated history so that it is only downloaded if needed. It's like a shallow clone, but without a shallow clone's limitations.
The idea is you reboot a branch by creating a new root commit, then cherry-pick the old branch's tip commit. Normally you would lose all of the history this way (which also means you don't have to clone those big .jar
files), but if the history is needed you can fetch the historical commits and use git replace
to seamlessly stitch them back in.
See Scott Chacon's excellent blog post for a detailed explanation and walk-through.
Advantages of this approach:
.jars
and everything, you still can.Disadvantages of this approach:
This approach still has some of the same problems as rewriting history. For example, if your new repository looks like this:
* modify bar (master) | * modify foo <--replace--> * modify foo (historical/master) | | * instructions * remove all of the big .jar files | * add another jar | * modify a jar |
and someone has an old branch off of the historical branch that they merge in:
* merge feature xyz into master (master) |\__________________________ | \ * modify bar * add feature xyz | | * modify foo <--replace--> * modify foo (historical/master) | | * instructions * remove all of the big .jar files | * add another jar | * modify a jar |
then the big historical commits will reappear in your main repository and you're back to where you started. Note that this is no worse than rewriting history—someone might accidentally merge in the pre-rewrite commits.
This can be mitigated by adding an update
hook in your shared repository to reject any pushes that would reintroduce the historical root commit(s).
No, that is not possible – You will have to rewrite history. But here are some pointers for that:
git filter-branch
.You do not need to clone again! Just run these commands instead of git pull
and you will be fine (replace origin
and master
with your remote and branch):
git fetch origin git reset --hard origin/master
But note that unlike git pull
, you will loose all the local changes that are not pushed to the server yet.
git pull
, git merge
and git rebase
(also as git rebase --onto
) do. Then give everybody involved a quick training on how to handle this rewrite situation (5-10 mins should be enough, the basic dos and don’ts).git filter-branch
does not cause any harm in itself, but causes a lot of standard workflows to cause harm. If people don’t act accordingly and merge old history, you might just have to rewrite history again if you don’t notice soon enough.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With