As I understand, some VCSs store differences between revisions, because, well, the differences are sometimes small - one line in a source code is changed or a comment is added in a subsequent revision. Git, on the other hand, stores compressed "snapshots" for each revision.
If only a small change has been made (one line in a large text file), how does Git treat this? Does it store two copies that are almost identical? This would be an inefficient use of space, I'd think.
Git stores every single version of each file it tracks as a blob. Git identifies blobs by the hash of their content and keeps them in . git/objects . Any change to the file content will generate a completely new blob object.
Git stores the complete history of your files for a project in a special directory (a.k.a. a folder) called a repository, or repo. This repo is usually in a hidden folder called . git sitting next to your files.
The super-short version is that git status runs git diff . In fact, it runs it twice, or more precisely, it runs two different internal variations on git diff : one to compare HEAD to the index/staging-area, and one to compare the staging-area to the work-tree.
When you commit, git stores snapshots of the entire file, it does not store diffs from the previous commit. As a repository grows, the object count grows exponentially and clearly it becomes inefficient to store the data as loose object files.
Does it store two copies that are almost identical? This would be an inefficient use of space, I'd think.
Yes, Git does exactly this, at least at first. When you make a commit, Git makes a (slightly compressed) copy of your source files under the .git/objects/
tree, with a name based on the SHA1 of the contents (these are called "loose" objects). You can go look at these files, and it's worthwhile to do so if you are curious about the format.
The point to remember is that Git is built for speed, and doesn't care very much about the size of the repository data. When Git wants to get an old revision to look at it, all it has to do is read the file as-is from the .git/objects/
tree. No application of deltas, just raw reading bytes with zlib decompression (which is very fast).
Now, you would be correct to observe that after you use a repository for a while, the files in .git/objects/
would contain a great many copies of your source files, all just a little bit different. That's where "pack" files come in. When you create a pack file (either automatically or manually), Git collects all the file objects together, sorts them in a way that will compress well, and compresses them into a pack file using a number of different techniques.
One of the techniques used when creating pack files is indeed, delta compression. Git will notice that two objects look very similar, and store one of the objects and a delta difference between them. Note that this is done on purely an object basis as raw data, without regard to the order in which things were committed or how your branches are arranged. The low level pack file format is an implementation detail as far as the rest of Git is concerned.
Remember, Git is still built for speed, so pack files are not necessarily the absolute best compression you can possibly get. There are a lot of heuristics in pack file creation related to tradeoffs between speed and size.
When Git wants to read an object and it's not a "loose" object, it will look in the pack files (which are in .git/objects/pack/
) to see if it can be found there. When Git finds the right pack file, it extracts the object from the pack file, applying whatever algorithm (delta resolution, decompression, etc) is needed to reconstruct the original file object. The higher level parts of Git do not care how the pack file stores the data, which is a good separation of concerns and simplifies the application code.
If you want to learn more about this, I suggest reading the Pro Git book, specifically the sections
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With