Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

can I save space in a git repo by squashing commits?

Tags:

git

If I have a git repository with an initial commit followed by 100 small commits that change 100 files, with each one just making one change to one file, can I save space by squashing those 100 commits into one big 100-files-changed commit? For instance:

$ git checkout master
Already on branch 'master'.
$ git reset --soft HEAD~100 && git commit -m 'squash last 100 commits'

will replace the tip of branch master with a new commit that has the same contents as the old commit, but leaves the 100 previous commits out of its history. How much space might this save?

like image 468
torek Avatar asked Sep 27 '14 23:09

torek


1 Answers

Maybe (even "probably") it will save some space, but not right away. In fact, at first it will make things just a bit bigger.

Let's take a look at how git actually stores things. It gets complicated, but it starts very simple: git stores every file completely intact (using "zlib deflate" compression but otherwise just the original file).

The git object model

In a git repository, everything is stored as an object. Each object is named by its SHA-1, which is a cryptographic checksum of its actual contents (its object-type, size, and data). This lets you do one of two things: compute the SHA-1 and store the object by its name (or discover that it's already in there); or, given the SHA-1 name, find the object and thus access its contents.

There are four types of objects. One is uninteresting here.1 The other three are:

  • "commit objects", which hold commit data, including the commit message itself plus the SHA-1 ID of a a "tree" object;
  • "tree" objects, which store lists of stuff: SHA-1, file-name, and the file's mode;2 and
  • "blob" (file) objects, which store your actual files. (Incidentally, the word "blob" is probably derived from the database term BLOB, which is a "backronym" for "Binary Large OBject".)

By starting from the SHA-1 ID of a commit, git can extract the tree, which tells it which blobs to extract and what file names to give them (blob object 1234567... is to be called file1.txt, for instance).

The actual objects are stored in subdirectories in .git/objects, e.g., object 1234567... is kept in .git/objects/12/34567.... (SHA-1s are always 40 characters long, but we mostly abbreviate them as 7 plus three dots, which is usually sufficient.)


1Just for completeness, the last object type is "annotated tag": it contains an author ("tagger") like a commit, and another SHA-1 like a commit, and a message like a commit, so it's basically very much like a commit; but the SHA-1 ID it contains is normally the ID of a commit object, not a tree object, and then there's a lightweight tag that points to the annotated tag. Among other things, this lets you put a cryptographically-signed tag into the repository, which others can check to see that you have approved that particular commit, for instance.

2The mode is really just one bit (execute or no-execute) for regular files, but git can also store symlinks, sub-trees, and "submodules", so there's actually a little bit more than just the one bit. For our purpose here we can ignore all but files though.


An example

Let's suppose that we create a repository and give it an initial commit with 100 files, each file being different from all other files. To keep things simple we'll put all 100 files in the top level as well (no sub-directories). This initial state of the repository, then, has:

  • one commit object
  • one tree object
  • 100 blobs

plus the usual git overhead (one branch file containing the tip-most SHA-1 ID for master, the HEAD file, and so on). We'll call this repo "hundredfile.git". The 100 files are just "file1.txt" through "file100.txt".

If we count objects in hundredfile.git, there will be 102 of them as per the list above.

Now we'll clone this repository so that we can make 100 commits, or 1 commit, and compare the results. First, let's do the 100-commits. The below is actually pseudo-code but close enough to really work (I think/hope) provided you have make_change_to set up to make a change to the file. Also, we want each change to produce a new unique file (so that all 100 files always differ from each other), otherwise some of the items in the description below become wrong.

$ git clone ssh://host.dom.ain/hundredfile.git method1
[clone messages]
$ cd method1
$ for i in $(jot 100); do  # note: jot 100 => print list of values 1, 2, ... 100
>   make_change_to file$i.txt; git add file$i.txt; git commit -m "change $i"
> done
[100 commit results come out here]

Each time we make a new commit, git turns the index (staging area) into a new tree with its new blobs; but we've only modified one file, so 99 of the 100 blobs are actually the same (have the same SHA-1 ID) as last time. Just the one modified file, file$i.txt, has a new and different SHA-1 ID.

Thus, each time we make a new commit, we get one new commit object ("change $i" plus the author-and-committer time stamps, plus the tree), one new "tree" object (to list the 99 identical blob-IDs plus the one new, different blob-ID), and one new "blob" object.

In other words, each commit adds three objects to the repository, and re-uses 99 existing blob objects. We repeat this process 100 times, hence adding 300 objects. 300 + 102 = 402, so this clone in method1 has 402 objects.

Now let's go back to the original hundredfile.git and make a new clone:

$ cd .. # up out of the "method1" repo
$ git clone ssh://host.dom.ain/hundredfile.git method2
[clone messages]
$ cd method2

This time, let's make one single commit after changing (and adding) all 100 files at once:

$ for i in $(jot 100); do
>   make_change_to file$i.txt; git add file$i.txt
> done
$ git commit -m 'change all'
[one commit result comes out here]

Here, all 100 files are different, so git stores one new commit with one new tree with 100 new blob-IDs in it. This repo now has 102+102 = 204 objects, instead of the 402 objects in method1.

This almost certainly takes quite a bit less disk space. The details vary from one system to another but in general any file takes at least 512 or 4096 (one "disk block worth") bytes to store. Since each git object is a disk file, storing more objects takes more space.

But there are several wrinkles.

Git is like the Borg: it tries to add to its collective

Git really likes to hang on to items. When you squash your 100 commits (in method1) into one, what git does is add the one new commit to its repository. This one new commit has your commit message (whatever it is) plus the usual dates and tree-ID and such. It has one tree, which is exactly the same as the final tree for the previous commit, because that tree stores the name-and-SHA-1 for each blob, which is also exactly the same as the previous blob for the file with the same name. (That is, the tree's "file1.txt is 1234567..." is the same in the new commit as in the original tip-of-branch commit, and this is true for every file, so the tree is the same, so its checksum is the same, so its SHA-1 ID is the same.)

So what you get in method1 is that the 402 objects become 403 objects: the original 402, plus the one new commit, which re-uses the previous tree and all its previous blobs. The repository gets just a bit bigger (probably one disk block for the one file).

Eventually, "unreferenced" objects are garbage-collected

If git never dropped anything, repositories would get seriously bloated, so there is a way for objects to be deleted. This is based on "references", which is a fancy word for "ways to find things". Branches are the most obvious form of reference: the branch reference file contains the SHA-1 ID of the tip of the branch. Tags also count, and "remote branches" and—the key in this particular case—"reflogs".

When you squash the 100 commits into one, the previous tip of your branch (the SHA-1 stored in master in the question above) is saved in two reflogs, one for HEAD and one for the branch. (The ID of the new squash-commit goes into master as usual, of course.)

These reflogs keep the old commits around, but only until the reflog entries expire. By default, the expiration time is set to 30 days (90 days for some cases, but 30 for this one). Once they have expired, git reflog expire will delete them (or you can delete them manually, but this is a little bit tricky).

At this point, the old commits become truly unreferenced: there is no way to find the SHA-1 ID for the previous commit. Now git's garbage collector (part of git gc—and note that git gc also runs git reflog expire for you) can remove the commit, and once it's gone, also the previous commit, and so on back to the first of the 100 commits. Those make the tree objects unreferenced, except for the last tree; and those in turn make the blobs unreferenced, except for the final blobs. (The last tree, and the final blobs, are still find-able through the squash commit you made.)

So now the repository is actually shrunk down to just the same 204 objects as in repo method2. (They're only the exact same objects if all the commit timestamps are the same, but the number of objects will shrink to 204.)

But there's one more wrinkle that makes all the previous wrinkles mostly irrelevant.

Git packs objects

Besides the "loose" format for objects, .git/objects/12/34567..., git has a "packed" format. Objects that are packed are compressed against other objects in the same pack.

When you make a change to some file, you get two different git blob objects.3 Each object is zlib-compressed, but git doesn't compare it to any other blobs at this point: it's "standalone compressed", as it were. But once two objects are stored in a pack file they can be "delta-compressed" against each other. The details on the delta format are rather obscure (and not all that important—git is up to pack-file-format number 4, and most people never noticed when it changed the previous times), but the point is that now git actually does store "changes". It's not necessarily "what changed in file1.txt though: it's possible that git has compressed file39.txt against file75.txt, for instance. It all depends on what's actually in the files and which objects git chooses to compress. It can even compress the other kinds of objects too.

As with reflogs and garbage collection, git's packing (or re-packing) is done automatically through git gc, and git invokes gc automatically for you whenever it thinks it's appropriate (see the setting for gc.auto).

You can do manual re-packing, expiration, and object collection if you like, and it's possible to tweak some of the parameters to get better packing sometimes, but that's well beyond the scope of this answer. Usually the automatic result is just fine, and compresses so well that it's not unusual for the .git directory to be smaller than any individual checked-out commit.


3More precisely, new files are stored as loose objects; existing objects stored in packs just stay in packs.


The bottom line

To save significant amounts of space, you must drop all references to large files (giant images or gzip'ed tar-balls or whatever) that don't compress well, even with delta-compression in pack files. You can do this with git filter-branch, although it's rather complicated; or you can use the BFG cleaner. See How to remove/delete a large file from commit history in Git repository? for several methods.

In general, in my opinion, trying to do this for individual commits is not worth it. Squash a bunch of commits if the result is a more sensible history; don't do it just to save disk space. It might save a bit, but not enough to be worth losing useful history. (On the other hand, losing useless history—history that makes later debugging harder instead of easier—is a good thing even if it makes the repository bigger!)

like image 58
torek Avatar answered Oct 24 '22 15:10

torek