Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remedial Lesson on Git Trees

Tags:

git

I've read and searched, searched and read, rinse, repeat, but a fundamental understanding of trees in Git continues to elude me (beyond the fact that they're loosely analogous to file system directories). They seem to be intrinsically linked to the index, but I just can't get the how through my thick skull.

Blobs are easy, of course, because they're a granular thing. Trees, at least conceptually, feel much more nebulous to me. Is there some way of explaining--in something approaching a remedial manner:

  1. How does Git detects that a tree needs to be created?
  2. What is stored beneath a tree at any given moment?
  3. Is a new tree "revision" created any time a blob beneath that tree is modified?

There may be other questions that I don't even know enough to ask, so feel free to elaborate in any way necessary to facilitate a coherent understanding of the object type and its context.

Much appreciated.

like image 842
Rob Wilkerson Avatar asked Dec 16 '22 23:12

Rob Wilkerson


2 Answers

This can be a first description:

alt text
(source: eagain.net)

(From Git for Computer Scientists)

But Git From the Bottom Up will have the most detailed description.

the index
Unlike other, similar tools you may have used, Git does not commit changes directly from the working tree into the repository. Instead, changes are first registered in something called the index.
Think of it as a way of “confirming” your changes, one by one, before doing a commit (which records all your approved changes at once).
Some find it helpful to call it instead as the “staging area”, instead of the index.

working tree
A working tree is any directory on your filesystem which has a repository associated with it (typically indicated by the presence of a sub-directory within it named .git.).
It includes all the files and sub-directories in that directory.

The difference between a Git blob and a filesystem’s file is that a blob stores no metadata about its content. All such information is kept in the tree that holds the blob.

One tree may know those contents as a file named “foo” that was created in August 2004, while another tree may know the same contents as a file named “bar” that was created five years later.
In a normal filesystem, two files with the same contents but with such different metadata would always be represented as two independent files.

Why this difference? Mainly, it’s because a filesystem is designed to support files that change, whereas Git is not.
The fact that data is immutable in the Git repository is what makes all of this work and so a different design was needed.


In short, to quote Git Internal (very short extract)

A tree is a simple list of trees and blobs that the tree contains, along with the names and modes of those trees and blobs.

More specifically, the content of a tree is:

a very simple text file that list the :

  • mode,
  • type,
  • sha1 and
  • name

of each entities.

(Jakub Narębski details in the comments:

Actually the tree object is not a text file: for some reason it stores SHA-1 in binary format.

But:

The commit object uses textual format for SHA-1 of parents and of top tree.

)


The OP adds in the comments:

What I think I'm having a hard time comprehending is that every commit has a tree.

It sure has. **A commit is a pointer to a **top level tree****, referenced by its SHA1.

And what triggers Git to create a tree initially?

Your first commit (the git init doesn't create a tree, just an empty Git repository)

According to Pro Git, there's a tie-in to the index, but no more information is provided.

You must be referring to the internal objects chapter:

Git normally creates a tree by taking the state of your staging area or index and writing a tree object from it.

So, as soon as you 'git add' some files (i.e. "staging them", or "adding them to the index"), you allow Git to create a tree from the index on your next commit.

alt text
(source: progit.org)

This is essentially what Git does when you run the git add and git commit commands

  • it stores blobs for the files that have changed,
  • updates the index,
  • writes out trees,
  • and writes commit objects that reference the top-level trees and the commits that came immediately before them.

These three main Git objects — the blob, the tree, and the commit — are initially stored as separate files in your .git/objects directory.

alt text
(source: progit.org)

like image 52
VonC Avatar answered Jan 03 '23 00:01

VonC


1. How does Git detects that a tree needs to be created?

When you commit, git builds a tree hierarchy for the contents of the index and then builds a commit referencing the root of that tree hierarchy. After the git-add operation, the repository contains blob objects for all of the files added, and the index contains references to the blobs paired with path names. There are no tree objects yet.

When you commit (technically, during the write-tree operation), git recursively constructs a set of trees using the index information. It starts with the trees that contain only blobs, determines their identifiers, and writes the tree objects. Then it goes up each level and constructs the next set of trees, since this cannot happen before the subtree identifiers are known. Then it stores the root-level tree.

A commit operation is broken down into the write-tree and commit-tree steps. The write-tree step uses the current state of the index to identify and (if necessary) store all of the trees. The commit-tree step creates a new commit referencing all of the parent commits and the root tree that was just created.

2. What is stored beneath a tree at any given moment?

When you learn how to use git, the main focus is on the directed acyclic graph (DAG) of commits: Each commit contains a pointer to the previous commit, and you can go back in time by following these links. This makes sense, since the user interface is about commits, and trees are really secondary.

The trees also form a DAG, but the difference is that they do not represent the history of commits. Just like a blob, once a tree is created, its identifier will forever point to that tree with those contents. If any of the blobs or trees listed in a tree is modified or removed, it will have a new identifier, and the tree itself will have a new name in the next commit.

3. Is a new tree "revision" created any time a blob beneath that tree is modified?

Ok, let's say your repository looks like this:

foo/
  a.txt
  b.txt
bar/
  a.txt
  b.txt

and all of the files are empty. Then there are three objects in the repository, not counting the commit:

  1. The top-level tree:

    $ git cat-file -p ebf247ec5ebc97b12cd7a56db330141568edb946
    040000 tree 2bdf04adb23d2b40b6085efb230856e5e2a775b7    bar
    040000 tree 2bdf04adb23d2b40b6085efb230856e5e2a775b7    foo
    
  2. A tree with two blobs:

    $ git cat-file -p 2bdf04adb23d2b40b6085efb230856e5e2a775b7
    100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    a.txt
    100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    b.txt
    
  3. The empty blob:

    $ git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
    

First I'll explain why the trees foo and bar are stored by the same object, then I'll make a change and see what happens.

The SHA1 identifier of a tree is determined entirely by its content, just like a blob. Note that its name is not involved, which means that renaming a tree will recreate its parent, but the tree itself will not need to be restored. If you paste the above output to git mktree, git will respond with the object name of the resulting tree. Under the hood, mktree produces the SHA1 like this ruby code:

>> require 'digest/sha1'
>> sha1 = ['e69de29bb2d1d6434b8b29ae775ad8c2e48c5391'].pack 'H*'
>> contents = "100644 a.txt\0#{sha1}100644 b.txt\0#{sha1}"
>> data = "tree #{contents.length}\0#{contents}"
>>  Digest::SHA1.hexdigest(data)
"2bdf04adb23d2b40b6085efb230856e5e2a775b7"

Now I'm going to modify 'bar/b.txt' and examine the new set of trees:

$ echo hello > bar/b.txt
$ git add bar/b.txt
$ git write-tree
5fa578acc6695bf2af2975ed0ffa7ab448b52c22
$ git cat-file -p 5fa578acc6695bf2af2975ed0ffa7ab448b52c22
040000 tree 9a514e08691a9f636665a43a1c89dc1920dab0fa    bar
040000 tree 2bdf04adb23d2b40b6085efb230856e5e2a775b7    foo

Since nothing underneath 'foo' changed, it is stored as the exact same tree. For large structures, this is a huge space win. There is a new tree for 'bar', since I modified it:

$ git cat-file -p 9a514e08691a9f636665a43a1c89dc1920dab0fa
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    a.txt
100644 blob ce013625030ba8dba906f756967f9e9ca394464a    b.txt
$ git cat-file -p ce013625030ba8dba906f756967f9e9ca394464a
hello

Again, nothing in the tree objects say anything about revisions or commits. If a tree and its children are unchanged from one commit to the next, they will be represented by the same object. If there are two identical trees in the same commit, they will also be represented by the same object.

Regarding the index, there is only a minimal link between it and the trees. One important distinction is that the index stores blob names and paths, uses a flat list, and does not mention trees at all:

$ git ls-files -s
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       bar/a.txt
100644 ce013625030ba8dba906f756967f9e9ca394464a 0       bar/b.txt
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       foo/a.txt
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       foo/b.txt

When data is copied from a tree to the index, the tree structure is flattened. When data is copied from the index to the trees, it is rebuilt.

References

  • Dulwich Tutorial
  • Git Magic
  • Pro Git - Internals
like image 24
Josh Lee Avatar answered Jan 02 '23 23:01

Josh Lee