How does git store duplicate files?

Tags:

git

We have a Git repository that contains SVM AI input data and results. Every time we run a new model, we create a new root folder for that model so that we can organize our results over time:

/run1.0   /data     ... 100 mb of data   /classification.csv   /results.csv   ... /run2.0   /data     ... 200 mb of data (including run1.0/data)   /classification.csv   /results.csv   ...

As we build new models we may pull in data (large .wav files) from a previous run. This means that our data folder 2.0 may contain all the files from 1.0/data plus additional data we may have collected.

The repo is easily going to exceed a Gigabyte if we keep this up.

Does Git have a way to recognize duplicate binary files and store them only once (e.g. like a symlink)? If not, we will rework how the data is stored.

388

asked Apr 29 '15 15:04

JoshuaJ

1 Answers

I am probably not going to explain this quite right but my understanding is that every commit stores only a tree structure representing the file structure of your project with pointers to the actual files which are stored in an objects sub folder. Git uses a SHA1 hash of the file contents to create the file name and sub folder, so for example if a file's contents created the following hash:

0b064b56112cc80495ba59e2ef63ffc9e9ef0c77

It would be stored as:

.git/objects/0b/064b56112cc80495ba59e2ef63ffc9e9ef0c77

The first two characters are used as a directory name and the rest as the file name.

The result is that even if you have multiple files with the same contents but different names or in different locations or from different commits only one copy would ever be saved but with several pointers to it in each commit tree.

124

answered Sep 21 '22 07:09

Dave Sexton

Related questions
                            
                                How do I move the contents of my master branch to a new Git branch?
                            
                                Git commit messages lost by vi
                            
                                In Git, how do I get a detailed list of file changes from one revision to another?
                            
                                How to make submodule with detached HEAD to be attached to actual HEAD?
                            
                                Cannot Pull b/c "You have unstaged changes", but status says there are no changes
                            
                                gpg: skipped "N": secret key not available
                            
                                How to get birds eye view of git tree with just branch names, not individual commits?
                            
                                vim says "No mouse support", but only when I run git commit
                            
                                Why is not recommended to have an Eclipse project folder as a Git repository?
                            
                                Make a shell script to update 3 git repos
                            
                                Heroku deploy a sub directory?
                            
                                What is the difference between "git push" and "git push origin master"? [duplicate]
                            
                                Why doesn't my Git status show me whether I'm up-to-date with my remote counterpart?
                            
                                Does Xcode 4 install git?
                            
                                Two people working on a file at the same time in git [duplicate]
                            
                                git: How to get "ours, theirs, original" for merge conflicts?
                            
                                Manually closing bitbucket's pull request
                            
                                git svn show-ignore gives error "command returned error: 1"
                            
                                git commit problems
                            
                                Can gitconfig options be set conditionally?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With