Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git design decision on storing content rather than differences

Tags:

git

Could anyone give me some idea to why git developers made a design decision to store contents of files (blobs), so when the content changes a new blob needs to be created?

I believe subversion stores revisions rather than contents, so when the content changes, it simply keeps track of the differences between the two. Couldn't git have done it like this as well? What's the benefit of storing contents rather than revisions?

like image 648
chibicode Avatar asked Sep 21 '09 04:09

chibicode


3 Answers

I couldn't find the answer with a quick google, but I believe it boils down to a simple "it doesn't matter 'cause disk space is cheap".

Storing revisions within a source code management tool is tricky. If you only ever store the difference between the previous revision and the current, you end up with two problems:

  1. Returning the latest revision (the common case) requires the most work, as the code needs to assemble that revision by combining every revision together.
  2. Any error (say, a disk fault) to one revision corrupts access to every later revision.

I believe that most modern VCS actually store the latest revision (for performance reasons) and differences, if used, are used to go back in time, not forwards.

like image 86
Bevan Avatar answered Nov 18 '22 03:11

Bevan


An article that addresses this (and related) issues is Repository Formats Matter. This was one of the articles that influenced my decision to move to Git a couple of years ago. Here is an excerpt:

Given this argument, it should be clear that I think git’s repository structure is better than others, at least for X.org’s usage model. It seems to hold several interesting properties:

  1. Files containing object data are never modified. Once written, every file is read-only from that point forward.

  2. Compression is done off-line and can be delayed until after the primary objects are saved to backup media. This method provides better compression than any incremental approach, allowing data to be re-ordered on disk to match usage patterns.

  3. Object data is inherently self-checking; you cannot modify an object in the repository and escape detection the first time the object is referenced.

like image 31
Greg Hewgill Avatar answered Nov 18 '22 03:11

Greg Hewgill


Let me clear up your misconceptions:

Could anyone give me some idea to why git developers made a design decision to store contents of files (blobs), so when the content changes a new blob needs to be created?

Quite good explanation of the (initial) Git design can be found in Tom Preston-Werner's The Git Parable essay (in addition to the one linked to in Greg Hewgill answer).

The idea behind it is that usually (in large enough project) in a new revision only a few files out of large number of files in a project change, so storing only different versions of the file contents saves space. This is the same idea that Subversion uses in its 'cheap copy' technique (it uses hardlinking, IIRC).

Also the contents of the file is zlib (deflate) compressed (or to be more exact each object in git repository database is compressed, including comit objects).

I believe Subversion stores revisions rather than contents, so when the content changes, it simply keeps track of the differences between the two. Couldn't git have done it like this as well? What's the benefit of storing contents rather than revisions?

I don't understand what you wanted to say here.

If it was that storing differences saves space, then I'd like to tell you that in addition to the 'loose' format (where each blob, i.e. each (different) contents of a file is stored in separate file inside .git) has also 'packed' format, where many objects are stored in deltaified form, using binary delta from LibXDiff library.

This format was created for network transfer (large disk space might be cheap, but bandwidth isn't), and was adapted as also on-disk format. This format is very efficient, one of more efficient if not most efficient version control systems formats, making git repositories smalles or one of smallest among different version control systems. Depending on circumstances full clone of git repository (which contains full history) might be smaller than equivalent Subversion checkout (which contains extra copy of pristine changes so that svn diff and svn status work without need for network transfer, with reasonable speed).

This design ('loose' and 'packed' format) has the advantage of very efficient packing, but had the disadvantage that you had to repack manually using "git gc" (not for disk space, but for performance - disk I/O); nowadays most git commands repack repository (safely) when needed.

like image 44
Jakub Narębski Avatar answered Nov 18 '22 01:11

Jakub Narębski