I have been reading the git book. In this book I learned that git functions through taking snapshots of the files you work with, instead of deltas like other VCSs. This has some excellent benefits. However, this leaves me wondering: over time, shouldn't the .git/ folder containing these snapshots blow up to be too large? There are repositories that have 10,000+ commits or more, with hundreds of files. Why doesn't git blow up in size?

The trick here is that this claim: <blockquote> git functions through taking snapshots of the files you work with, instead of deltas like other VCSs </blockquote> is both true and false! Git's main object database—a key-value store—stores four object types. We don't need to go into all the details here; we can just note that files—or more precisely, files' contents—are stored in blob objects. Commit objects then refer (indirectly) to the blob objects, so if you have some file content named <code>bigfile.txt</code> and store it in 1000 different commits, there's only one object in all of those commits, re-used 1000 times. (In fact, if you rename it to <code>hugefile.txt</code> without changing its content, new commits continue to re-use the original object—the name is stored separately, in tree objects.) That's all fine, but over time, most files in most projects do accumulate changes. Other VCSes will, instead of storing a whole new copy of each file, make use of delta encoding to avoid storing every version of every file separately. If a blob object is a complete, intact (albeit zlib-deflated) file, your question boils down to this: wouldn't the accumulation of separate blob objects make the object database grow much faster than a VCS that uses delta compression? The answer is that it would, but Git does use delta compression. It just does it below the level of the object database. Objects are logically independent. You give Git the key—the hash ID—for some object, and you get the entire object back. But only so-called loose objects are stored as a simple zlib-deflated file. As Jonathan Brink noted, <code>git gc</code> cleans up unused objects. This does not help with retained objects, such as older versions of <code>hugefile.txt</code> or whatever. But <code>git gc</code>—which Git runs automatically whenever Git thinks it might be appropriate—does more than just prune unreferenced objects. It also runs <code>git repack</code>, which builds or re-builds pack files. A pack file stores multiple objects, and inside a pack file, objects are delta-compressed. Git pores over the collection of all objects that will go into a single pack file, and for all N objects, picks some set B of them to use as delta bases. These object are merely zlib-deflated. The remaining N-B objects are encoded as deltas, against either the bases, or against earlier delta-encoded objects that use those bases. Hence, given a key for an object stored in a pack file, Git can find the stored object or delta, and if what is stored is a delta, Git can also find the underlying objects, all the way down to the delta bases, and hence extract the complete object. Hence, Git does use delta encoding, but only within a pack file. It's also based not on files but rather on objects, so (at least in theory) if you have huge trees, or long texts inside commit messages, those can be compressed against each other as well. Even this is not quite the whole story though: for transmission over networks, Git will build so-called thin packs. The key difference between a regular pack and a thin pack has to do with those delta bases. Given a regular pack file and a hash ID, Git can always retrieve the complete object from that file alone. With a thin pack, however, Git is allowed to use objects that are not in that pack file (as long as the other Git, to which the thin-pack is being transported, has claimed that it has those objects). The receiver is required to "fix" the thin pack on receipt, but this allows <code>git fetch</code> and <code>git push</code> to send deltas rather than complete snapshots.

If git functions off of snapshots of files, why doesn't .git/ become huge over time?

Tags:

git

I have been reading the git book. In this book I learned that git functions through taking snapshots of the files you work with, instead of deltas like other VCSs. This has some excellent benefits.

However, this leaves me wondering: over time, shouldn't the .git/ folder containing these snapshots blow up to be too large? There are repositories that have 10,000+ commits or more, with hundreds of files. Why doesn't git blow up in size?

462

asked Aug 16 '18 18:08

MonkeySeeMonkeyCode

1 Answers

The trick here is that this claim:

git functions through taking snapshots of the files you work with, instead of deltas like other VCSs

is both true and false!

Git's main object database—a key-value store—stores four object types. We don't need to go into all the details here; we can just note that files—or more precisely, files' contents—are stored in blob objects. Commit objects then refer (indirectly) to the blob objects, so if you have some file content named bigfile.txt and store it in 1000 different commits, there's only one object in all of those commits, re-used 1000 times. (In fact, if you rename it to hugefile.txt without changing its content, new commits continue to re-use the original object—the name is stored separately, in tree objects.)

That's all fine, but over time, most files in most projects do accumulate changes. Other VCSes will, instead of storing a whole new copy of each file, make use of delta encoding to avoid storing every version of every file separately. If a blob object is a complete, intact (albeit zlib-deflated) file, your question boils down to this: wouldn't the accumulation of separate blob objects make the object database grow much faster than a VCS that uses delta compression?

The answer is that it would, but Git does use delta compression. It just does it below the level of the object database. Objects are logically independent. You give Git the key—the hash ID—for some object, and you get the entire object back. But only so-called loose objects are stored as a simple zlib-deflated file.

As Jonathan Brink noted, git gc cleans up unused objects. This does not help with retained objects, such as older versions of hugefile.txt or whatever. But git gc—which Git runs automatically whenever Git thinks it might be appropriate—does more than just prune unreferenced objects. It also runs git repack, which builds or re-builds pack files.

A pack file stores multiple objects, and inside a pack file, objects are delta-compressed. Git pores over the collection of all objects that will go into a single pack file, and for all N objects, picks some set B of them to use as delta bases. These object are merely zlib-deflated. The remaining N-B objects are encoded as deltas, against either the bases, or against earlier delta-encoded objects that use those bases. Hence, given a key for an object stored in a pack file, Git can find the stored object or delta, and if what is stored is a delta, Git can also find the underlying objects, all the way down to the delta bases, and hence extract the complete object.

Hence, Git does use delta encoding, but only within a pack file. It's also based not on files but rather on objects, so (at least in theory) if you have huge trees, or long texts inside commit messages, those can be compressed against each other as well.

Even this is not quite the whole story though: for transmission over networks, Git will build so-called thin packs. The key difference between a regular pack and a thin pack has to do with those delta bases. Given a regular pack file and a hash ID, Git can always retrieve the complete object from that file alone. With a thin pack, however, Git is allowed to use objects that are not in that pack file (as long as the other Git, to which the thin-pack is being transported, has claimed that it has those objects). The receiver is required to "fix" the thin pack on receipt, but this allows git fetch and git push to send deltas rather than complete snapshots.

192

answered Oct 01 '22 22:10

torek

Related questions
                            
                                How to get winmerge to show diff for new file in git?
                            
                                How do I configure TortoiseGit to use Meld for diff, merge and conflicts?
                            
                                Why would I ever git checkout --detach
                            
                                Most efficient way to keep a fork up-to-date
                            
                                Git seems to store the whole file instead of diff, how to avoid that?
                            
                                fatal: could not unset 'remote.origin.url'
                            
                                Visual Studio Code: Git diff over git gutter indicator
                            
                                Git credentials when npm has a git dependency
                            
                                How to clone a directory from a remote repo in a Jenkins pipeline script?
                            
                                How to view git commits when local branch is ahead of origin Commits
                            
                                How do I remove an item from the Intellij default changelist?
                            
                                When to use git branch --track (meaning of start "watching upstream")?
                            
                                git fork repo to same organization
                            
                                How to resolve a git conflict by keeping all additions from both sides?
                            
                                GitKraken - Delete Account
                            
                                fatal: Unable to find remote helper for 'https'
                            
                                Resetting git config
                            
                                Git log for a directory including merges
                            
                                Is the storage of git tags inefficient?
                            
                                git: list dangling tags

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With