Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does git store duplicate files?

Tags:

git

We have a Git repository that contains SVM AI input data and results. Every time we run a new model, we create a new root folder for that model so that we can organize our results over time:

/run1.0   /data     ... 100 mb of data   /classification.csv   /results.csv   ... /run2.0   /data     ... 200 mb of data (including run1.0/data)   /classification.csv   /results.csv   ... 

As we build new models we may pull in data (large .wav files) from a previous run. This means that our data folder 2.0 may contain all the files from 1.0/data plus additional data we may have collected.

The repo is easily going to exceed a Gigabyte if we keep this up.

Does Git have a way to recognize duplicate binary files and store them only once (e.g. like a symlink)? If not, we will rework how the data is stored.

like image 388
JoshuaJ Avatar asked Apr 29 '15 15:04

JoshuaJ


People also ask

Does Git store copies of files?

Git logically stores each file under its SHA1. What this means is if you have two files with exactly the same content in a repository (or if you rename a file), only one copy is stored. But this also means that when you modify a small part of a file and commit, another copy of the file is stored.

Does Git store diffs or whole files?

No, commit objects in git don't contain diffs - instead, each commit object contains a hash of the tree, which recursively and completely defines the content of the source tree at that commit.

How does Git store information?

Git stores every single version of each file it tracks as a blob. Git identifies blobs by the hash of their content and keeps them in . git/objects . Any change to the file content will generate a completely new blob object.

How does Git keep track of files?

Indexing. For every tracked file, Git records information such as its size, creation time and last modification time in a file known as the index. To determine whether a file has changed, Git compares its current stats with those cached in the index. If they match, then Git can skip reading the file again.


1 Answers

I am probably not going to explain this quite right but my understanding is that every commit stores only a tree structure representing the file structure of your project with pointers to the actual files which are stored in an objects sub folder. Git uses a SHA1 hash of the file contents to create the file name and sub folder, so for example if a file's contents created the following hash:

0b064b56112cc80495ba59e2ef63ffc9e9ef0c77 

It would be stored as:

.git/objects/0b/064b56112cc80495ba59e2ef63ffc9e9ef0c77 

The first two characters are used as a directory name and the rest as the file name.

The result is that even if you have multiple files with the same contents but different names or in different locations or from different commits only one copy would ever be saved but with several pointers to it in each commit tree.

like image 124
Dave Sexton Avatar answered Sep 21 '22 07:09

Dave Sexton