How does Git store large files over many commits?

Tags:

git

So I have started using git for a while now and understanding how it works gradually. One main point I understood is that - It creates a snapshot every time a new commit is made. Of course snapshot will contain only changed files and pointers to unchanged file.

According to Pro Git § 1.3 Getting Started - Git Basics

Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again—just a link to the previous identical file it has already stored.

But let's say I have really big file e.g. 2GB text file. And I change that file 10 times and hence make 10 commits in a day, does that mean - I now have 10 2GB files on my computer? That seems really inefficient to me So I am believing this might not be the case.

Could someone clarify what would happen in this scenario?

981

asked May 02 '14 06:05

RandomQuestion

2 Answers

The short answer is "yes, you now have 10 2GB files". However:

"Files" under a commit are stored as "blob" objects, and all git objects (blobs, trees, commits, and annotated-tags) are kept internally in zlib deflated format. So a 2 GB text file is actually a considerably smaller object.
"Loose" objects (all of them, again) are eventually "packed". You can do this manually with git pack-objects and git repack but generally you just let git do it on its own as part of standard "garbage collection" (git gc). Inside a pack, objects are delta-compressed against similar objects. The end result with most files is pretty impressive.

All that said, git eventually fails badly if you feed it a lot of large incompressible binary files (I had to deal with this at a previous workplace, where we stuffed 2GB of .tgz files into repos). They don't deflate, they generally don't delta-compress, and eventually even the pack format falls over. There are at least two solutions in relatively widespread use: git-annex and git-bup. See Managing large binary files with git.

answered Oct 22 '22 14:10

torek

I just tested it.

First I created a large file (24 MB of text) and committed it. My .git directory is now 216 KB large. git uses compression and my text file was easy to compress.

I then made a small change on the first line in the file and committed that. My .git directory is now 356 KB large. .git/objects now contains two objects, both 132 KB large.

132K    ./.git/objects/8d
132K    ./.git/objects/f7

After running git gc those two objects are compressed into a pack-file only 68 KB.

So at least under some circumstances git will keep entire copies of large files for a while.

answered Oct 22 '22 14:10

Andreas Wederbrand

Related questions
                            
                                Eclipse - Exclude root directory from git repository?
                            
                                git-svn ignore large binary files
                            
                                How to get git log with color and paging on cygwin?
                            
                                how to send somebody my (already generated) github public rsa key(send the key or the .pub file?)
                            
                                String comparison not working in PowerShell function - what am I doing wrong?
                            
                                Git diff doesn't work after git stash pop
                            
                                Formatting commit messages
                            
                                "unqualified destination" error with git subtree push
                            
                                Does the "version" in Composer git repository refer to the release (tag) on GitHub?
                            
                                error: a NUL byte in commit log message not allowed
                            
                                editing commits with git rebase
                            
                                How do contents of git index evolve during a merge (and what's in the index after a failed merge)?
                            
                                How to Setup Corkscrew to Connect to Github through Draconian Proxy
                            
                                Creating a remote Git repository and populating it with files already stored in the remote location
                            
                                How to debug Cannot run program "C:\Program Files\Git" in a Jenkins job?
                            
                                git - push java .classpath file?
                            
                                How to find latest non-merge commit message in Git?
                            
                                git shows many changes in repository after moving project folder
                            
                                Ansible sudo_user not using the correct $HOME directory
                            
                                Can't push/pull to Github using Gitbox after password change

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With