Git internals: how does Git store small differences between revisions?

Tags:

git

As I understand, some VCSs store differences between revisions, because, well, the differences are sometimes small - one line in a source code is changed or a comment is added in a subsequent revision. Git, on the other hand, stores compressed "snapshots" for each revision.

If only a small change has been made (one line in a large text file), how does Git treat this? Does it store two copies that are almost identical? This would be an inefficient use of space, I'd think.

720

asked Apr 12 '17 03:04

flow2k

1 Answers

Does it store two copies that are almost identical? This would be an inefficient use of space, I'd think.

Yes, Git does exactly this, at least at first. When you make a commit, Git makes a (slightly compressed) copy of your source files under the .git/objects/ tree, with a name based on the SHA1 of the contents (these are called "loose" objects). You can go look at these files, and it's worthwhile to do so if you are curious about the format.

The point to remember is that Git is built for speed, and doesn't care very much about the size of the repository data. When Git wants to get an old revision to look at it, all it has to do is read the file as-is from the .git/objects/ tree. No application of deltas, just raw reading bytes with zlib decompression (which is very fast).

Now, you would be correct to observe that after you use a repository for a while, the files in .git/objects/ would contain a great many copies of your source files, all just a little bit different. That's where "pack" files come in. When you create a pack file (either automatically or manually), Git collects all the file objects together, sorts them in a way that will compress well, and compresses them into a pack file using a number of different techniques.

One of the techniques used when creating pack files is indeed, delta compression. Git will notice that two objects look very similar, and store one of the objects and a delta difference between them. Note that this is done on purely an object basis as raw data, without regard to the order in which things were committed or how your branches are arranged. The low level pack file format is an implementation detail as far as the rest of Git is concerned.

Remember, Git is still built for speed, so pack files are not necessarily the absolute best compression you can possibly get. There are a lot of heuristics in pack file creation related to tradeoffs between speed and size.

When Git wants to read an object and it's not a "loose" object, it will look in the pack files (which are in .git/objects/pack/) to see if it can be found there. When Git finds the right pack file, it extracts the object from the pack file, applying whatever algorithm (delta resolution, decompression, etc) is needed to reconstruct the original file object. The higher level parts of Git do not care how the pack file stores the data, which is a good separation of concerns and simplifies the application code.

If you want to learn more about this, I suggest reading the Pro Git book, specifically the sections

10.2 Git Internals - Git Objects
10.4 Git Internals - Packfiles

answered Oct 06 '22 04:10

Greg Hewgill

Related questions
                            
                                Why are Github project document page urls case sensitive? What are the negative effects?
                            
                                What's the use of `-u` in `git push -u origin master`? [duplicate]
                            
                                GIT add revert in my case (keep changes)
                            
                                Git through a Proxy. What is causing the 407 error when cloning?
                            
                                How can I see the date multiple files were created on git?
                            
                                Git http.proxy Setting
                            
                                EGit on Eclipse: How to git push --force?
                            
                                Sync all branches with git
                            
                                Git pretty format string equivalent of oneline, including colors
                            
                                Differences between the staged and unstaged versions of the same file, using difftool [duplicate]
                            
                                How to update git commit author, but keep original date when amending?
                            
                                cant fix bad object HEAD error with git status
                            
                                Working directory diff with Git Extensions
                            
                                gitignore directory exception not working
                            
                                Merge local branch into remote branch other than master?
                            
                                What is a quilt patchset?
                            
                                how can I do a git pull in the gitg / gitx visual tool?
                            
                                git submodule update fails with error on one machine but works on another machine
                            
                                How to create repository in github through github API?
                            
                                How to stop merging in git?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With