Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where are tree hashes stored in git?

Tags:

git

hash

I am following this tutorial (https://jwiegley.github.io/git-from-the-bottom-up/1-Repository/3-blobs-are-stored-in-trees.html) to learn about git architecture.

The command

$ git cat-file commit HEAD

gives me the hash for the tree referenced by HEAD, "0563f77d884e4f79ce95117e2d686d7d6e282887". Now, I try to find this hash in .git:

$ find .git/ | xargs grep "0563f77"

Why does nothing come up? Is this hash not stored anywhere?

like image 381
Joshua Meyers Avatar asked Jul 04 '16 04:07

Joshua Meyers


1 Answers

I think you're mixing together several concepts here:

  • Git's internal object names (SHA-1 hashes) are unique to (and are entirely dictated by, and hence in a philosophical sense are) the contents of the object. More correctly, they are the SHA-1 hash of the object's type-name taken as a string (commit, blob, tree, or tag), followed by a space and then a decimalized representation of the length of the object in bytes, followed by a NUL or zero byte, followed by the raw data of the underlying object.

    Note that if you hash the same object twice, you get the same hash both times. Thus, if a file named README.txt has some text in it, and then you copy that file to read-me-too.txt and hash that file, you get the same hash again. This is because the name of the file is not part of the input to the hash computation, only the type (in this case blob), the blank, the size, the zero-byte, and the contents.

    If the two files contain just one line reading hello (plus a newline, for six bytes total), the input to the hash function is blob 6\0hello\n (where \0 and \n stand for the zero-byte and the newline). In fact, the hash of these two files is ce013625030ba8dba906f756967f9e9ca394464a. (I used git hash-object to find this value, although any SHA-1 code will do the trick: you can find this with a few lines of Python or Ruby code, or a fair number of lines of C code, for instance. Hashing trees is trickier.)

    The object ID ce013625030ba8dba906f756967f9e9ca394464a represents a file containing the word hello followed by a newline.1 (If we know what data a file contains, we can hash the data and find the Git object ID. Normally we go the other way: we start from a valid Git object ID, and we retrieve the data from the repository. But when we git add a file, we go this way, turning the data into a hash and storing it as a Git object, if it is not already in the repository. If it is already in, we're all good: we just use the same hash again.)

  • The object itself—the object's data—is stored somewhere in the Git repository.

    The location you found, where object 0563f77d884e4f79ce95117e2d686d7d6e282887 is in a directory named 05 with a file whose name starts with 63f77 and continues on with the rest of the hash, is where Git currently keeps what it calls loose objects. However, Git also packs objects into what it calls pack files.

    The format of pack files is rather complicated, and would take too long to go into here. We can, however, say that a single pack file can store tens of thousands of objects. (Pack file formats have been revised several times to improve performance and individual object accessibility.)

  • We need a method to convert from human-readable names, like branch names, to Git hashes. This is what you found in the search you noted in a comment:

    It worked for the commit hash, returned by $ git rev-parse HEAD. This hash is stored in .git/refs/heads/master [and two reflogs]

    Git's design offers two particularly well-distinguished external name forms, specifically branch names and tag names, with which we can remember particular commit hashes. Git's general term for this is references. Git's remote-tracking branches are references as well, stored under refs/remotes/. Besides these branch and tag names, you are likely to encounter notes and "the stash" (git stash): these also use references, specifically those in refs/notes/ and the (single) name refs/stash respectively.

    As with objects, reference values are stored somewhere in the Git repository, but you are not promised that they remain in individual files. As of today (Git version 2.9) they are always in either individual files like the one you found, or in a single special file named packed-refs (or occasionally in both: in this case the individual file has the correct value, if the two disagree).

A branch name is just a reference that starts with refs/heads/2. A tag is a name that starts with refs/tags/3. Either one will let you find the SHA-1 hash of a commit. The key difference between the two is that a branch name is expected to change over time, pointing to the newest commit on the branch; but a tag name should point to the same commit forever.

In fact, not only is a branch name expected to change, Git will automatically change it for you. In particular, if git status says that you are on branch master, and you make a new commit, Git will change refs/heads/master to point to the new commit. Git also makes the new commit have, as its parent commit ID, the commit master pointed to just before you made the new commit. This is how a branch grows: the reference always points to the tip-most commit, by definition. That tip-most commit points, through its parent ID, to an earlier commit, which points further back in history, and so on. (And if a commit is a merge commit, it has two, or maybe even three or more, parent IDs instead of just one.)

What this means is that a key place you will find these Git object IDs is inside other Git objects.

This is what you see when you pretty-print a commit (with git cat-file -p HEAD or git cat-file commit HEAD, both of which do the same thing): you view the contents of the tip commit of the current branch, and you see tree <ugly-sha-1>. So the tree ID is stored in the commit. If the commit is itself in a loose object, however, and you bring up .git/objects/05/... in a file editor or viewer, you won't see that hash, or even the word tree. This is because the repository data is compressed (specifically, with zlib; objects stored in pack files are compressed differently, using a modified version of xdelta, and then also zlib-deflated). This is also why you can and should use something like git cat-file to view the object's contents: that insulates you from the location and format details. All you need is the object's ID; git cat-file will find and decompress the object.

Tree objects themselves contain additional Git object IDs, as you can see by using git cat-file -p on a tree:

$ git cat-file -p 'HEAD^{tree}'
[snip]
100644 blob cb2ca2bb2e86aa4a4c3c9b08490c72b04a1778d3    rfuncs.h
040000 tree 05006c6f2e6119fede241cf6ec845291a5be665e    sbuf
[snip more]

Thus, one particular Git blob object (cb2ca2b...) and one additional Git tree object (05006c6...) have their Git-object-names saved away inside the tree associated with the HEAD commit.


1The Pigeonhole Principle tells us that if we hash enough different objects, we will get ce013625030ba8dba906f756967f9e9ca394464a for at least two different files. On that day, Git breaks. :-) It takes a huge number of inputs to get a hash collision, though. Probability mathematics suggests that you will lose data on thousands of disk drives long before you get a Git hash collision, even if you have billions of files. In fact, it takes about 1.71 quadrillion files to raise the probability of a hash collision to one in 10-18, which is a typical quoted error rate for enterprise-grade storage media.

Of course, these assume random chance inputs, rather than maliciously constructed files using cryptography theory to attempt to break Git.

2It's no coincidence that you found the file master inside refs/heads. Someday, though, Git may no longer store names in flat files since this imposes file-system restrictions on branch naming: in particular it makes it impossible to have both a branch named x and a branch named x/y. Note that when the references are in .git/packed-refs, it is possible to have both x and x/y, at least in an information-theoretic sense. It's merely an annoying file system restriction that you cannot have a file named x and a directory also named x containing a file named y. (There's no particularly good reason for this file system restriction either, except that POSIX requires it.)

3If a tag is an annotated tag, it refers to a Git object of type "tag", which then points to the next object. In fact, this is the definition of an annotated tag name: it's a name in refs/tags/ that points to an annotated tag object. The tag object usually points directly to a commit, although you can tag a tag object, instead of tagging a commit directly, and then have to peel both tag layers off to get to the underlying commit.

Git will let you point a tag (lightweight or annotated) to any Git object, but will normally only let you point a branch name to a commit object.

like image 118
torek Avatar answered Sep 28 '22 02:09

torek