Context: I was hoping to be able to search through my git commit messages and commits without having to go through the puzzlingly complex git grep command, so I decided to see how git commit messages were stored.
I took a look in a .git folder, and it looks to me like commits are stored in
.git/objects
The .git objects folder contains a bunch of folders with names like a6 and 9b. These folders each contain a file with a name that looks like a commit sha 2f29598814b07fea915514cfc4d05129967bf7. When I open one of those files in a text editor, I get gibberish.
In this git commit log, the folder 9b contains one commit sha
aed8a9f773efb2f498f19c31f8603b6cb2a4bc
Why, and is there a case where more than one commit sha would be stored in the file 9b?
is there a way to convert this gibberish into plain text so that I can mess with commits in a text editor?
Repository Structure There are four major types of Git objects, blobs, trees, commits, and tags. The names of these objects are all SHA-1 hashes. You can use the git cat-file -t command to view the type of each SHA-1. You can use the git cat-file -p command to view the contents and simple data structure of each object.
Underneath the hood, git has a concept of objects. Objects are generally made up of a header plus some data. File content gets stored as a blob object. Tree objects contain filenames and point to blob objects that represent the files, and tree objects that represent other directories.
Commits are the core building block units of a Git project timeline. Commits can be thought of as snapshots or milestones along the timeline of a Git project. Commits are created with the git commit command to capture the state of a project at that point in time.
Create a minimal example and reverse engineer the format
Create a simple repository, and before any packfiles are created (git gc
, git config gc.auto
, git-prune-packed
...), unpack a commit object with one of the methods from: How to DEFLATE with a command line tool to extract a git object?
export GIT_AUTHOR_DATE="1970-01-01T00:00:00+0000"
export GIT_AUTHOR_EMAIL="[email protected]"
export GIT_AUTHOR_NAME="Author Name" \
export GIT_COMMITTER_DATE="2000-01-01T00:00:00+0000" \
export GIT_COMMITTER_EMAIL="[email protected]" \
export GIT_COMMITTER_NAME="Committer Name" \
git init
# First commit.
echo
touch a
git add a
git commit -m 'First message'
python -c "import zlib,sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))" \
<.git/objects/45/3a2378ba0eb310df8741aa26d1c861ac4c512f | hd
# Second commit.
echo
touch b
git add b
git commit -m 'Second message'
python -c "import zlib,sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))" \
<.git/objects/74/8e6f7e22cac87acec8c26ee690b4ff0388cbf5 | hd
The output is:
Initialized empty Git repository in /home/ciro/test/git/.git/
[master (root-commit) 453a237] First message
Author: Author Name <[email protected]>
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 a
00000000 63 6f 6d 6d 69 74 20 31 37 34 00 74 72 65 65 20 |commit 174.tree |
00000010 34 39 36 64 36 34 32 38 62 39 63 66 39 32 39 38 |496d6428b9cf9298|
00000020 31 64 63 39 34 39 35 32 31 31 65 36 65 31 31 32 |1dc9495211e6e112|
00000030 30 66 62 36 66 32 62 61 0a 61 75 74 68 6f 72 20 |0fb6f2ba.author |
00000040 41 75 74 68 6f 72 20 4e 61 6d 65 20 3c 61 75 74 |Author Name <aut|
00000050 68 6f 72 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 3e |[email protected]>|
00000060 20 30 20 2b 30 30 30 30 0a 63 6f 6d 6d 69 74 74 | 0 +0000.committ|
00000070 65 72 20 43 6f 6d 6d 69 74 74 65 72 20 4e 61 6d |er Committer Nam|
00000080 65 20 3c 63 6f 6d 6d 69 74 74 65 72 40 65 78 61 |e <committer@exa|
00000090 6d 70 6c 65 2e 63 6f 6d 3e 20 39 34 36 36 38 34 |mple.com> 946684|
000000a0 38 30 30 20 2b 30 30 30 30 0a 0a 46 69 72 73 74 |800 +0000..First|
000000b0 20 6d 65 73 73 61 67 65 0a | message.|
000000ba
[master 748e6f7] Second message
Author: Author Name <[email protected]>
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 b
00000000 63 6f 6d 6d 69 74 20 32 32 33 00 74 72 65 65 20 |commit 223.tree |
00000010 32 39 36 65 35 36 30 32 33 63 64 63 30 33 34 64 |296e56023cdc034d|
00000020 32 37 33 35 66 65 65 38 63 30 64 38 35 61 36 35 |2735fee8c0d85a65|
00000030 39 64 31 62 30 37 66 34 0a 70 61 72 65 6e 74 20 |9d1b07f4.parent |
00000040 34 35 33 61 32 33 37 38 62 61 30 65 62 33 31 30 |453a2378ba0eb310|
00000050 64 66 38 37 34 31 61 61 32 36 64 31 63 38 36 31 |df8741aa26d1c861|
00000060 61 63 34 63 35 31 32 66 0a 61 75 74 68 6f 72 20 |ac4c512f.author |
00000070 41 75 74 68 6f 72 20 4e 61 6d 65 20 3c 61 75 74 |Author Name <aut|
00000080 68 6f 72 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 3e |[email protected]>|
00000090 20 30 20 2b 30 30 30 30 0a 63 6f 6d 6d 69 74 74 | 0 +0000.committ|
000000a0 65 72 20 43 6f 6d 6d 69 74 74 65 72 20 4e 61 6d |er Committer Nam|
000000b0 65 20 3c 63 6f 6d 6d 69 74 74 65 72 40 65 78 61 |e <committer@exa|
000000c0 6d 70 6c 65 2e 63 6f 6d 3e 20 39 34 36 36 38 34 |mple.com> 946684|
000000d0 38 30 30 20 2b 30 30 30 30 0a 0a 53 65 63 6f 6e |800 +0000..Secon|
000000e0 64 20 6d 65 73 73 61 67 65 0a |d message.|
000000eb
Then we deduce that the format is as follows:
Top level:
commit {size}\0{content}
where {size}
is the number of bytes in {content}
.
This follows the same pattern for all object types.
{content}
:
tree {tree_sha}
{parents}
author {author_name} <{author_email}> {author_date_seconds} {author_date_timezone}
committer {committer_name} <{committer_email}> {committer_date_seconds} {committer_date_timezone}
{commit message}
where:
{tree_sha}
: SHA of the tree object this commit points to.
This represents the top-level Git repo directory.
That SHA comes from the format of the tree object: What is the internal format of a git tree object?
{parents}
: optional list of parent commit objects of form:
parent {parent1_sha}
parent {parent2_sha}
...
The list can be empty if there are no parents, e.g. for the first commit in a repo.
Two parents happen in regular merge commits.
More than two parents are possible with git merge -Xoctopus
, but this is not a common workflow. Here is an example: https://github.com/cirosantilli/test-octopus-100k
{author_name}
: e.g.: Ciro Santilli
. Cannot contain <
, \n
{author_email}
: e.g.: [email protected]
. Cannot contain >
, \n
{author_date_seconds}
: seconds since 1970, e.g. 946684800
is the first second of year 2000
{author_date_timezone}
: e.g.: +0000
is UTC
committer fields: analogous to author fields
{commit message}
: arbitrary.
I've made a minimal Python script that generates a git repo with a few commits at: https://github.com/cirosantilli/test-git-web-interface/blob/864d809c36b8f3b232d5b0668917060e8bcba3e8/other-test-repos/util.py#L83
I've used that for fun things like:
Here is an analogous analysis of the tag object format: What is the format of a git tag object and how to calculate its SHA?
Before you head down this path much further, I might recommend that you read through the section in the Git Manual about its internals. I find that knowing the contents of this chapter is usually the difference between liking Git and hating it. Understanding why Git is doing things the way it does often makes all of the sort of weird commands it has for things make more sense.
To answer your question, the gibberish that you are seeing is the data for the object after it has been compressed using zlib. If you look under the heading "Object Storage" in the link above you can see some details about how this works. This is the short version of how files are stored in git:
So that answers your second question, a folder will contain all of the compressed objects that begin with the same two characters, regardless of their contents.
If you want to see the contents of a blob, all you have to do is decompress it. If you just want to view the contents of the file, this can be done easily enough in most programming languages. I would warn you against trying to modify data, however. Modifying even a single byte in a file will change it's hash. All of the metadata in git (namely, directory structures and commits) are stored using references to hashes, so modifying a single file means that you must also update all objects downstream from that file that reference that file's hash. Then you have to update all the objects that reference those hashes. And on, and on, and on... Trying to achieve this becomes very, very complicated very quickly. You'll save your self a lot of time and heartache by just learning git's built in commmands.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With