Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can someone explain the distinction between content tracking used in Git and file tracking used in other SCMs

Tags:

git

svn

I've been using Git for a while now and love the features and flexibility in workflow it allows. The ability to commit early and often is a huge deal for me and really fits into my way of working.

One feature of Git I've heard mentioned many times but have yet to get my head around is the fact that it tracks content rather than file history which supposed makes dealing with renaming and moving files much better.

Can someone explain why this is? I haven't noticed anything special in this regard compared to SVN. What am I missing?

like image 519
BenBtg Avatar asked Apr 09 '11 09:04

BenBtg


2 Answers

Git stores three pieces of data separately:

  • content is stored in blob objects
  • history is stored in commit objects
  • structure is stored in tree objects

A consequence of this is that if you have the same data in several files, git only has to store it once, because the structure (which contains directories and files) only has to point at one content object.

Similarly, if a file does not change from version to version, git only has to store that file once. Multiple history objects point to the same content.

Some of the user visible benefits is that git blame is very good at seeing code move across files especially if you tell it to look real hard with git blame -C. It's also some of why git is so compact and fast, the structure is very simple, very cheap to walk and doesn't repeat itself.

One of the downsides is that git doesn't store file copies and renames, it just guesses, and sometimes it's wrong.

This blog entry provides a decently well digested but still detailed discussion of what content tracking buys git. If you want to know more, you can watch Linus' Google Tech Talk on Git or read the transcript.

like image 186
Schwern Avatar answered Sep 30 '22 15:09

Schwern


The only information that Git stores from one revision to the next is the state (the names and contents) of the files at each revision. In revision A, this file had this content, and in revision B, this file had that other content. Git doesn't care how the files got from point A to point B, whether it was an edit, or a rename, or a conflict resolution, or an octopus merge.

This approach has the benefit of a conceptually simple repository format. This is important because your repository is your history, and history should be preserved in the simplest format possible.

One implication of this is that whenever Git needs to figure out what happened between revision A and B (for example), it needs to work out the details at the time you ask for it. Even for a simple diff, while some tools might be able to simply show the internally stored diff, Git compares the files in revision A and B and regenerates the diff when requested. For renames, Git notices that a new file just appeared, and looks for similar files in a previous revision to guess whether the file was renamed or not.

As the Git tools improve over time, more of how the history was formed can be reported, without it having to have been recorded at the time. For example, it is often claimed that Git can "track individual bits of code moving from one file to another". This is entirely due to the cleverness of the programs doing the history reporting, and not due to anything stored in the repository itself.

like image 23
Greg Hewgill Avatar answered Sep 30 '22 15:09

Greg Hewgill