Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git's blob data and diff information

Tags:

git

diff

As far as I know, Git's blob has SHA1 hash as file name, in order not to duplicate the file in the repository.

For example, if file A has a content of "abc" and has a SHA1 hash as "12345", as long as the content doesn't change, the commits/branches can point to the same SHA1.

But, what would happen if file A is modified to "def" to have SHA hash "23456"? Does Git store file A, and modified file A (not the difference only, but the whole file)?

  • If so, why is that? Isn't it better to store the diff info?
  • If not, how does diff track the changes in a file?
  • How about the other VCS systems - CVS/SVN/Perforce...?

ADDED

The following from 'Git Community Book' answers most of my questions.

It is important to note that this is very different from most SCM systems that you may be familiar with. Subversion, CVS, Perforce, Mercurial and the like all use Delta Storage systems - they store the differences between one commit and the next. Git does not do this - it stores a snapshot of what all the files in your project look like in this tree structure each time you commit. This is a very important concept to understand when using Git.

like image 709
prosseek Avatar asked Sep 18 '10 21:09

prosseek


People also ask

Which information is stored in a Git blob object?

A Git blob (binary large object) is the object type used to store the contents of each file in a repository. The file's SHA-1 hash is computed and stored in the blob object. These endpoints allow you to read and write blob objects to your Git database on GitHub.

How do I see differences between files in Git?

You can run the git diff HEAD command to compare the both staged and unstaged changes with your last commit. You can also run the git diff <branch_name1> <branch_name2> command to compare the changes from the first branch with changes from the second branch. Order does matter when you're comparing branches.

What are the four different types of Git objects?

Git places only four types of objects in the object store: the blobs, trees, commits, and tags. These four atomic objects form the foundation of Git's higher level data structures. Each version of a file is represented as a blob.

What information does Git status show?

The git status command displays the state of the working directory and the staging area. It lets you see which changes have been staged, which haven't, and which files aren't being tracked by Git. Status output does not show you any information regarding the committed project history.


1 Answers

git stores files by content rather than diffs so in your example, both versions of A ("abc" and "def") would be stored in the object database.

  • It works out better to store whole objects because it is very easy to see if two versions of the file are the same or not just by comparing their SHAs. Have a look at the git-book for details on how the objects are stored. This works out better because if files were tracked with diffs you would need the entire history of a file to reconstruct it. Easy to do in a centralised system, but not in a distributed system where there can be many different changes to a file.

  • Git performs the diff directly from the objects.

like image 61
Abizern Avatar answered Sep 19 '22 01:09

Abizern