Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

If git functions off of snapshots of files, why doesn't .git/ become huge over time?

Tags:

git

I have been reading the git book. In this book I learned that git functions through taking snapshots of the files you work with, instead of deltas like other VCSs. This has some excellent benefits.

However, this leaves me wondering: over time, shouldn't the .git/ folder containing these snapshots blow up to be too large? There are repositories that have 10,000+ commits or more, with hundreds of files. Why doesn't git blow up in size?

like image 462
MonkeySeeMonkeyCode Avatar asked Aug 16 '18 18:08

MonkeySeeMonkeyCode


People also ask

Does GIT store snapshots or diffs?

When you commit, git stores snapshots of the entire file, it does not store diffs from the previous commit. As a repository grows, the object count grows exponentially and clearly it becomes inefficient to store the data as loose object files.

How does Git store file changes?

Instead, Git thinks of its data more like a series of snapshots of a miniature filesystem. With Git, every time you commit, or save the state of your project, Git basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot.

How does git Delta work?

When Git use deltas and how this works out? Git uses delta compression to efficiently store the objects when it creates pack files. This is an implementation detail about the compression algorithm used -- it's completely immaterial to using git.

When might git diff head yield no results?

There is no output to git diff because Git doesn't see any changes inside your repository, only files outside the repository, which it considers 'untracked' and so ignores when generating a diff. I found this one of the key differences to version control systems like SVN (along with staging and ignoring directories).


1 Answers

The trick here is that this claim:

git functions through taking snapshots of the files you work with, instead of deltas like other VCSs

is both true and false!

Git's main object database—a key-value store—stores four object types. We don't need to go into all the details here; we can just note that files—or more precisely, files' contents—are stored in blob objects. Commit objects then refer (indirectly) to the blob objects, so if you have some file content named bigfile.txt and store it in 1000 different commits, there's only one object in all of those commits, re-used 1000 times. (In fact, if you rename it to hugefile.txt without changing its content, new commits continue to re-use the original object—the name is stored separately, in tree objects.)

That's all fine, but over time, most files in most projects do accumulate changes. Other VCSes will, instead of storing a whole new copy of each file, make use of delta encoding to avoid storing every version of every file separately. If a blob object is a complete, intact (albeit zlib-deflated) file, your question boils down to this: wouldn't the accumulation of separate blob objects make the object database grow much faster than a VCS that uses delta compression?

The answer is that it would, but Git does use delta compression. It just does it below the level of the object database. Objects are logically independent. You give Git the key—the hash ID—for some object, and you get the entire object back. But only so-called loose objects are stored as a simple zlib-deflated file.

As Jonathan Brink noted, git gc cleans up unused objects. This does not help with retained objects, such as older versions of hugefile.txt or whatever. But git gc—which Git runs automatically whenever Git thinks it might be appropriate—does more than just prune unreferenced objects. It also runs git repack, which builds or re-builds pack files.

A pack file stores multiple objects, and inside a pack file, objects are delta-compressed. Git pores over the collection of all objects that will go into a single pack file, and for all N objects, picks some set B of them to use as delta bases. These object are merely zlib-deflated. The remaining N-B objects are encoded as deltas, against either the bases, or against earlier delta-encoded objects that use those bases. Hence, given a key for an object stored in a pack file, Git can find the stored object or delta, and if what is stored is a delta, Git can also find the underlying objects, all the way down to the delta bases, and hence extract the complete object.

Hence, Git does use delta encoding, but only within a pack file. It's also based not on files but rather on objects, so (at least in theory) if you have huge trees, or long texts inside commit messages, those can be compressed against each other as well.

Even this is not quite the whole story though: for transmission over networks, Git will build so-called thin packs. The key difference between a regular pack and a thin pack has to do with those delta bases. Given a regular pack file and a hash ID, Git can always retrieve the complete object from that file alone. With a thin pack, however, Git is allowed to use objects that are not in that pack file (as long as the other Git, to which the thin-pack is being transported, has claimed that it has those objects). The receiver is required to "fix" the thin pack on receipt, but this allows git fetch and git push to send deltas rather than complete snapshots.

like image 192
torek Avatar answered Oct 01 '22 22:10

torek