What is the format of a Git tree object's content?
The content of a blob object is blob [size of string] NUL [string]
, but what is it for a tree object?
A "tree" in Git is an object (a file, really) which contains a list of pointers to blobs or other trees. Each line in the tree object's file contains a pointer (the object's hash) to one such object (tree or blob), while also providing the mode, object type, and a name for the file or directory.
A Git tree object creates the hierarchy between files in a Git repository. You can use the Git tree object to create the relationship between directories and the files they contain. These endpoints allow you to read and write tree objects to your Git database on GitHub.
The git log command is a useful command that allows you to look at Git commits history. However, this text-based log may not be preferred by most users, since the output can be very difficult and complex to visualize and interpret. A more visually appealing way to present this log is in the form of a Git tree.
Git places only four types of objects in the object store: the blobs, trees, commits, and tags. These four atomic objects form the foundation of Git's higher level data structures. Each version of a file is represented as a blob.
The format of a tree object:
tree [content size]\0[Entries having references to other trees and blobs]
The format of each entry having references to other trees and blobs:
[mode] [file/folder name]\0[SHA-1 of referencing blob or tree]
I wrote a script deflating tree objects. It outputs as follows:
tree 192\0 40000 octopus-admin\0 a84943494657751ce187be401d6bf59ef7a2583c 40000 octopus-deployment\0 14f589a30cf4bd0ce2d7103aa7186abe0167427f 40000 octopus-product\0 ec559319a263bc7b476e5f01dd2578f255d734fd 100644 pom.xml\0 97e5b6b292d248869780d7b0c65834bfb645e32a 40000 src\0 6e63db37acba41266493ba8fb68c76f83f1bc9dd
The number 1 as the first character of a mode shows that is reference to a blob/file. The example above, pom.xml is a blob and the others are trees.
Note that I added new lines and spaces after \0
for the sake of pretty printing. Normally all the content has no new lines. Also I converted 20 bytes (i.e. the SHA-1 of referencing blobs and trees) into hex string to visualize better.
I try to elaborate a bit more on @lemiorhan answer, by means of a test repo.
Create a test project in an empty folder:
$ echo ciao > file1 $ mkdir folder1 $ echo hello > folder1/file2 $ echo hola > folder1/file3
That is:
$ find -type f ./file1 ./folder1/file2 ./folder1/file3
Create the local Git repo:
$ git init $ git add . $ git write-tree 0b6e66b04bc1448ca594f143a91ec458667f420e
The last command returns the hash of the top level tree.
To print the content of a tree in human readable format use:
$ git ls-tree 0b6e66 100644 blob 887ae9333d92a1d72400c210546e28baa1050e44 file1 040000 tree ab39965d17996be2116fe508faaf9269e903c85b folder1
In this case 0b6e66
are the first six characters of the top tree. You can do the same for folder1
.
To get the same content but in raw format use:
$ git cat-file tree 0b6e66 100644 file1 ▒z▒3=▒▒▒$ ▒►Tn(▒▒♣D40000 folder1 ▒9▒]▒k▒◄o▒▒▒i▒♥▒[%
The content is similar to the one physically stored as a file in compressed format, but it misses the initial string:
tree [content size]\0
To get the actual content, we need to uncompress the file storing the c1f4bf
tree object. The file we want is -- given of the 2/38 path format --:
.git/objects/0b/6e66b04bc1448ca594f143a91ec458667f420e
This file is compressed with zlib, therefore we obtain its content with:
$ openssl zlib -d -in .git/objects/0b/6e66b04bc1448ca594f143a91ec458667f420e tree 67 100644 file1 ▒z▒3=▒▒▒$ ▒►Tn(▒▒♣D40000 folder1 ▒9▒]▒k▒◄o▒▒▒i▒♥▒[%
We learn the tree content size is 67.
Note that, since the terminal is not made for printing binaries, it might eat some part of the string or show other weird behaviour. In this case pipe the commands above with | od -c
or use the manual solution in the next section.
To understand the tree generation process we can generate it ourselves starting from its human readable content, e.g. for the top tree:
$ git ls-tree 0b6e66 100644 blob 887ae9333d92a1d72400c210546e28baa1050e44 file1 040000 tree ab39965d17996be2116fe508faaf9269e903c85b folder1
Each object ASCII SHA-1 hash is converted and stored in binary format. If what you need is just a binary version of the ASCII hashes, you can do it with:
$ echo -e "$(echo ASCIIHASH | sed -e 's/../\\x&/g')"
So the blob 887ae9333d92a1d72400c210546e28baa1050e44
is converted to
$ echo -e "$(echo 887ae9333d92a1d72400c210546e28baa1050e44 | sed -e 's/../\\x&/g')" ▒z▒3=▒▒▒$ ▒►Tn(▒▒♣D
If we want to create the whole tree object, here is an awk one-liner:
$ git ls-tree 0b6e66 | awk -b 'function bsha(asha)\ {patsplit(asha, x, /../); h=""; for(j in x) h=h sprintf("%c", strtonum("0x" x[j])); return(h)}\ {t=t sprintf("%d %s\0%s", $1, $4, bsha($3))} END {printf("tree %s\0%s", length(t), t)}' tree 67 100644 file1 ▒z▒3=▒▒▒$ ▒►Tn(▒▒♣D40000 folder1 ▒9▒]▒k▒◄o▒▒▒i▒♥▒[%
The function bsha
converts the SHA-1 ASCII hashes to binaries. The tree content is first put into the variable t
and then its length is calculated and printed in the END{...}
section.
As observed above, the console is not very suitable for printing binaries, so we might want to replace them with their \x##
format equivalent:
$ git ls-tree 0b6e66 | awk -b 'function bsha(asha)\ {patsplit(asha, x, /../); h=""; for(j in x) h=h sprintf("%s", "\\x" x[j]); return(h)}\ {t=t sprintf("%d %s\0%s", $1, $4, bsha($3))} END {printf("tree %s\0%s", length(t), t)}' tree 187 100644 file1 \x88\x7a\xe9\x33\x3d\x92\xa1\xd7\x24\x00\xc2\x10\x54\x6e\x28\xba\xa1\x05\x0e\x4440000 folder1 \xab\x39\x96\x5d\x17\x99\x6b\xe2\x11\x6f\xe5\x08\xfa\xaf\x92\x69\xe9\x03\xc8\x5b%
The output should be a good compromise for understanding the tree content structure. Compare the output above with the general tree content structure
tree [content size]\0[Object Entries]
where each Object Entry is like:
[mode] [Object name]\0[SHA-1 in binary format]
Modes are a subset of UNIX filesystem modes. See Tree Objects on Git manual for more details.
We need to make sure that the results are consistent. To this end, we might compare the checksum of the awk generated tree with the checksum of the Git stored tree.
As for the latter:
$ openssl zlib -d -in .git/objects/0b/6e66b04bc1448ca594f143a91ec458667f420e | shasum 0b6e66b04bc1448ca594f143a91ec458667f420e *-
As for the home made tree:
$ git ls-tree 0b6e66 | awk -b 'function bsha(asha)\ {patsplit(asha, x, /../); h=""; for(j in x) h=h sprintf("%c", strtonum("0x" x[j])); return(h)}\ {t=t sprintf("%d %s\0%s", $1, $4, bsha($3))} END {printf("tree %s\0%s", length(t), t)}' | shasum 0b6e66b04bc1448ca594f143a91ec458667f420e *-
The checksum is the same.
The more or less official way to get it is:
$ git ls-tree 0b6e66 | git mktree 0b6e66b04bc1448ca594f143a91ec458667f420e
To calculate it manually, we need to pipe the content of the script generated tree into the shasum
command. Actually we have already done this above (to compare the generated and stored content). The results was:
0b6e66b04bc1448ca594f143a91ec458667f420e *-
and is the same as with git mktree
.
You might find that, for your repo, you are unable to find the files .git/objects/XX/XXX...
storing the Git objects. This happens because some or all "loose" objects have been packed into one or more .git\objects\pack\*.pack
files.
To unpack the repo, first move the pack files away from their original position, then git-unpack the objects.
$ mkdir .git/pcache $ mv .git/objects/pack/*.pack .git/pcache/ $ git unpack-objects < .git/pcache/*.pack
To repack when you are done with experiments:
$ git gc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With