Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does git track changes to files

Tags:

git

internals

For the longest time I thought git commits keep diffs of changed files and not copies. Any information I could find states the contrary. I conducted a little experiment:

$ git init
$ subl wtf

Here I create a file with 99 999 lines, each of which is foo bar baz #line

$ ls -la
total 1760
drwxrwxr-x 3 __user__ __user__    4096 Aug 13 21:02 .
drwxr-xr-x 3 __user__ __user__    4096 Aug 13 19:57 ..
drwxrwxr-x 7 __user__ __user__    4096 Aug 13 21:02 .git
-rw-rw-rw- 1 __user__ __user__ 1788875 Aug 13 21:02 wtf
$ git add --all
$ git commit -m 'Initial commit'
[master (root-commit) 6ef5084] Initial commit
 1 file changed, 99999 insertions(+)
 create mode 100644 wtf
$ subl wtf
$ git diff
diff --git a/wtf b/wtf
index 7ba3acb..bf7a9ed 100644
--- a/wtf
+++ b/wtf
@@ -14156,7 +14156,7 @@ foo bar baz 14155
 foo bar baz 14156
 foo bar baz 14157
 foo bar baz 14158
-foo bar baz 14159
+foo qux baz 14159
 foo bar baz 14160
 foo bar baz 14161
 foo bar baz 14162
$ git add --all
$ git commit -m 'bar -> qux on #14159'
[master 1b5ab4b] bar -> qux on #14159
 1 file changed, 1 insertion(+), 1 deletion(-)
$ subl wtf
$ git diff
diff --git a/wtf b/wtf
index bf7a9ed..1aeeaa3 100644
--- a/wtf
+++ b/wtf
@@ -14156,7 +14156,7 @@ foo bar baz 14155
 foo bar baz 14156
 foo bar baz 14157
 foo bar baz 14158
-foo qux baz 14159
+xyz abc baz 14159
 foo bar baz 14160
 foo bar baz 14161
 foo bar baz 14162
$ git add --all
$ git commit -m 'foo qux -> xyz abc on #14159'
[master 85ccf97] foo qux -> xyz abc on #14159
 1 file changed, 1 insertion(+), 1 deletion(-)
$ ls -la
total 1760
drwxrwxr-x 3 __user__ __user__    4096 Aug 13 21:02 .
drwxr-xr-x 3 __user__ __user__    4096 Aug 13 19:57 ..
drwxrwxr-x 9 __user__ __user__    4096 Aug 13 21:05 .git
-rw-rw-rw- 1 __user__ __user__ 1788875 Aug 13 21:04 wtf

Even commits on different branches with conflicts didn't change the situation.

If git truly keeps copies of all changed files with every commit, how come there was no significant change in space used?

like image 426
ndnenkov Avatar asked Aug 13 '15 18:08

ndnenkov


3 Answers

The git has object database. There is a type of object "blob" which is identified by sha1 of its content. So, it means, if you have a file of the same content anywhere in repository (branch/point of history/directory/etc) it will be stored in the database only once.

There are two parts in the database, the objects/??/* files which are individual objects. I.e. if you have two versions of a large file which has only single line difference - it will be stored twice, in two different files (using simple lzma? compression).

Then, if git thinks the objects directory grew too much, it runs garbage collection. One of the steps of this process - repacking. It creates large pack files in the objects/pack/ folder which use clever delta-compression algorithm, and it works across not on a history of a particular file, but in the scope of the whole object database, so it means even if some completely unrelated files look similar occasionally, they could be packed as deltas of one another.

So, the deltas could be re-compressed differently after each git gc command taking in account latest changes in the history.

Also, object packs vs loose objects are only physical storage details, which are completely transparent when you use git everyday. E.g. doing log cherry-pick, merge etc are operating with full snapshot of a commit. So, if you are doing diff, it just compares two versions of a directory/files on fly, generating you a patch/diff.

This approach is quite unique in comparison to other VCS. E.g. Mercurial stores immutable delta-logs for each file separately, and Subversion is storing deltas for the whole repository. And it affects how system works - physical storage is not abstracted away and it causes some significant limitations, while git allows very flexible work-flows and algorithms while keeping the size of the repository very small

like image 69
kan Avatar answered Nov 04 '22 10:11

kan


Every time a file changes, Git stores a new copy of that file in its database. A commit stores a reference to the most recent version of a file tracked by that commit. This means that when a commit is created, it uses the reference stored by its parent for unchanged files, and the reference to the newly added version for changed files.

Periodically (or on demand with, say, git gc), the database is compacted by creating pack files which contain the most recent version for each file in a given set, along with "reverse diffs" that can be used to reconstruct older versions as needed.

like image 30
chepner Avatar answered Nov 04 '22 09:11

chepner


At least two mechanisms reduce the total storage needed in Git's object database. First, each object is compressed individually. Second, objects are lumped together into object "packs" that relate the objects with deltas, saving even more space for similar objects. There's a chapter on packfiles in ProGit which is quite illuminating.

like image 25
Wolf Avatar answered Nov 04 '22 08:11

Wolf