Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Git store large files over many commits?

Tags:

git

So I have started using git for a while now and understanding how it works gradually. One main point I understood is that - It creates a snapshot every time a new commit is made. Of course snapshot will contain only changed files and pointers to unchanged file.

According to Pro Git § 1.3 Getting Started - Git Basics

Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again—just a link to the previous identical file it has already stored.

But let's say I have really big file e.g. 2GB text file. And I change that file 10 times and hence make 10 commits in a day, does that mean - I now have 10 2GB files on my computer? That seems really inefficient to me So I am believing this might not be the case.

Could someone clarify what would happen in this scenario?

like image 981
RandomQuestion Avatar asked May 02 '14 06:05

RandomQuestion


People also ask

How does Git store large files?

Git LFS is an extension to Git which commits data describing the large files in a commit to your repo, and stores the binary file contents into separate remote storage. When you clone and switch branches in your repo, Git LFS downloads the correct version from that remote storage.

How does Git store its data?

Git stores every single version of each file it tracks as a blob. Git identifies blobs by the hash of their content and keeps them in . git/objects . Any change to the file content will generate a completely new blob object.

Does Git have a storage limit?

GitHub limitsOnly the 100 MB threshold is blocked and this is the GitHub file size limit. If you are uploading via browser, the limit is even lower – the file can be no larger than 25 MB.

How large is too large for a Git repo?

Note: If you add a file to a repository via a browser, the file can be no larger than 25 MB. For more information, see "Adding a file to a repository." GitHub blocks files larger than 100 MB. To track files beyond this limit, you must use Git Large File Storage (Git LFS).


2 Answers

The short answer is "yes, you now have 10 2GB files". However:

  1. "Files" under a commit are stored as "blob" objects, and all git objects (blobs, trees, commits, and annotated-tags) are kept internally in zlib deflated format. So a 2 GB text file is actually a considerably smaller object.

  2. "Loose" objects (all of them, again) are eventually "packed". You can do this manually with git pack-objects and git repack but generally you just let git do it on its own as part of standard "garbage collection" (git gc). Inside a pack, objects are delta-compressed against similar objects. The end result with most files is pretty impressive.

All that said, git eventually fails badly if you feed it a lot of large incompressible binary files (I had to deal with this at a previous workplace, where we stuffed 2GB of .tgz files into repos). They don't deflate, they generally don't delta-compress, and eventually even the pack format falls over. There are at least two solutions in relatively widespread use: git-annex and git-bup. See Managing large binary files with git.

like image 96
torek Avatar answered Oct 22 '22 14:10

torek


I just tested it.

First I created a large file (24 MB of text) and committed it. My .git directory is now 216 KB large. git uses compression and my text file was easy to compress.

I then made a small change on the first line in the file and committed that. My .git directory is now 356 KB large. .git/objects now contains two objects, both 132 KB large.

132K    ./.git/objects/8d
132K    ./.git/objects/f7

After running git gc those two objects are compressed into a pack-file only 68 KB.

So at least under some circumstances git will keep entire copies of large files for a while.

like image 38
Andreas Wederbrand Avatar answered Oct 22 '22 14:10

Andreas Wederbrand