For the longest time I thought git commits keep diffs of changed files and not copies. Any information I could find states the contrary. I conducted a little experiment:
$ git init
$ subl wtf
Here I create a file with 99 999 lines, each of which is foo bar baz #line
$ ls -la
total 1760
drwxrwxr-x 3 __user__ __user__ 4096 Aug 13 21:02 .
drwxr-xr-x 3 __user__ __user__ 4096 Aug 13 19:57 ..
drwxrwxr-x 7 __user__ __user__ 4096 Aug 13 21:02 .git
-rw-rw-rw- 1 __user__ __user__ 1788875 Aug 13 21:02 wtf
$ git add --all
$ git commit -m 'Initial commit'
[master (root-commit) 6ef5084] Initial commit
1 file changed, 99999 insertions(+)
create mode 100644 wtf
$ subl wtf
$ git diff
diff --git a/wtf b/wtf
index 7ba3acb..bf7a9ed 100644
--- a/wtf
+++ b/wtf
@@ -14156,7 +14156,7 @@ foo bar baz 14155
foo bar baz 14156
foo bar baz 14157
foo bar baz 14158
-foo bar baz 14159
+foo qux baz 14159
foo bar baz 14160
foo bar baz 14161
foo bar baz 14162
$ git add --all
$ git commit -m 'bar -> qux on #14159'
[master 1b5ab4b] bar -> qux on #14159
1 file changed, 1 insertion(+), 1 deletion(-)
$ subl wtf
$ git diff
diff --git a/wtf b/wtf
index bf7a9ed..1aeeaa3 100644
--- a/wtf
+++ b/wtf
@@ -14156,7 +14156,7 @@ foo bar baz 14155
foo bar baz 14156
foo bar baz 14157
foo bar baz 14158
-foo qux baz 14159
+xyz abc baz 14159
foo bar baz 14160
foo bar baz 14161
foo bar baz 14162
$ git add --all
$ git commit -m 'foo qux -> xyz abc on #14159'
[master 85ccf97] foo qux -> xyz abc on #14159
1 file changed, 1 insertion(+), 1 deletion(-)
$ ls -la
total 1760
drwxrwxr-x 3 __user__ __user__ 4096 Aug 13 21:02 .
drwxr-xr-x 3 __user__ __user__ 4096 Aug 13 19:57 ..
drwxrwxr-x 9 __user__ __user__ 4096 Aug 13 21:05 .git
-rw-rw-rw- 1 __user__ __user__ 1788875 Aug 13 21:04 wtf
Even commits on different branches with conflicts didn't change the situation.
If git truly keeps copies of all changed files with every commit, how come there was no significant change in space used?
The git has object database. There is a type of object "blob" which is identified by sha1 of its content. So, it means, if you have a file of the same content anywhere in repository (branch/point of history/directory/etc) it will be stored in the database only once.
There are two parts in the database, the objects/??/*
files which are individual objects. I.e. if you have two versions of a large file which has only single line difference - it will be stored twice, in two different files (using simple lzma? compression).
Then, if git thinks the objects
directory grew too much, it runs garbage collection. One of the steps of this process - repacking. It creates large pack files in the objects/pack/
folder which use clever delta-compression algorithm, and it works across not on a history of a particular file, but in the scope of the whole object database, so it means even if some completely unrelated files look similar occasionally, they could be packed as deltas of one another.
So, the deltas could be re-compressed differently after each git gc
command taking in account latest changes in the history.
Also, object packs
vs loose objects
are only physical storage details, which are completely transparent when you use git everyday. E.g. doing log
cherry-pick
, merge
etc are operating with full snapshot of a commit. So, if you are doing diff, it just compares two versions of a directory/files on fly, generating you a patch/diff.
This approach is quite unique in comparison to other VCS. E.g. Mercurial stores immutable delta-logs for each file separately, and Subversion is storing deltas for the whole repository. And it affects how system works - physical storage is not abstracted away and it causes some significant limitations, while git allows very flexible work-flows and algorithms while keeping the size of the repository very small
Every time a file changes, Git stores a new copy of that file in its database. A commit stores a reference to the most recent version of a file tracked by that commit. This means that when a commit is created, it uses the reference stored by its parent for unchanged files, and the reference to the newly added version for changed files.
Periodically (or on demand with, say, git gc
), the database is compacted by creating pack files which contain the most recent version for each file in a given set, along with "reverse diffs" that can be used to reconstruct older versions as needed.
At least two mechanisms reduce the total storage needed in Git's object database. First, each object is compressed individually. Second, objects are lumped together into object "packs" that relate the objects with deltas, saving even more space for similar objects. There's a chapter on packfiles in ProGit which is quite illuminating.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With