When and how does git use deltas for storage?

Tags:

Reading git's documentation one of the things they stress a lot is that git stores snapshots and not deltas. Since I saw a course on Git saying that Git stores differences between versions of files I tried the following: I initialized a git repository on an empty folder, created a file lorem.txt containing some lorem ipsum text staged the file and commited.

Then using find .git/objects -type f on command line I listed what git saved on the objects folder and as expected found a commit object pointing to a tree object pointing to a blob object containing the lorem ispum text I saved.

Then I modified the lorem ipsum text, adding more content to it, staged this change and commited. Listing again the files, I could see now the new commit object, pointing to a new three object and to a new blob object. Using git cat-file -p 331cf0780688c73be429fa602f9dd99f18b36793 I could see the contents of the new blob. They were exactly the contents of the full lorem.txt file, the old contents plus the change.

This works as expected by the documentation: git stores snapshots, not deltas. However, searching on the internet I found this SO question. On th accepted answer we see the following:

While that's true and important on the conceptual level, it is NOT true at the storage level.

Git does use deltas for storage.

Not only that, but it's more efficient in it than any other system. Because it does not keep per-file history, when it wants to do delta-compression, it takes each blob, selects some blobs that are likely to be similar (using heuristics that includes the closest approximation of previous version and some others), tries to generate the deltas and picks the smallest one. This way it can (often, depends on the heuristics) take advantage of other similar files or older versions that are more similar than the previous. The "pack window" parameter allows trading performance for delta compression quality. The default (10) generally gives decent results, but when space is limited or to speed up network transfers, git gc --aggressive uses value 250, which makes it run very slow, but provide extra compression for history data.

Which says that Git does use deltas for storage. As I understand from this, Git doesn't use deltas all the time, but only when it detects it is necessary. Is this true?

I placed a lot of lorem text on the file, so that it's 2mb in size. I thought that when making a small change to a big text file Git would automatically use deltas, but as I said it didn't.

When Git use deltas and how this works out?

895

asked Jan 29 '15 19:01

user1620696

1 Answers

Git only uses deltas in "packfiles". Initially, each git object is written as a separate file (as you found). Later, git can pack many objects into one file, called a "pack file". The pack file is then compressed, which automatically exploits any repetitions between the files in the packfile (or repetitions inside files).

This packing is performed by git repack. You can see it in action by invoking it manually. If you run git repack -ad on a git repo, you should see used disk space and number of files under .git/objects drop, as files are combined into packs and compressed.

In practice, you don't usually need to run git repack. Git by default regularly runs git gc, which in turn runs git repack when necessary. So relax, git has your back :-).

The excellent "git book" also has a chapter on packfiles with more explanations: http://git-scm.com/book/en/v2/Git-Internals-Packfiles .

156

answered Oct 06 '22 05:10

sleske

Related questions
                            
                                Capture screenshot of browser content (website) [closed]
                            
                                JSDoc @param together with @deprecated
                            
                                Docker Container compared with Unikernel
                            
                                Interactive plotting with R raster: values on mouseover
                            
                                How to customize the docker run command on Elastic Beanstalk?
                            
                                Illegal mix of collations (utf8mb4_unicode_ci,EXPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='
                            
                                Fast linear interpolation in Numpy / Scipy "along a path"
                            
                                Difference between Mockito @Spy and @Mock(answer = Answers.CALLS_REAL_METHODS)
                            
                                Can an out-of-range enum conversion produce a value outside the underlying type?
                            
                                Submit compliant version of suspended app on Google Play [closed]
                            
                                Extract features using pre-trained (Tensorflow) CNN
                            
                                Visual Studio 2015 Enterprise with ReSharper 10 Ultimate -- Cannot be properly resolved

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With