Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the file format of a git commit object data structure?

Tags:

git

Context: I was hoping to be able to search through my git commit messages and commits without having to go through the puzzlingly complex git grep command, so I decided to see how git commit messages were stored.

I took a look in a .git folder, and it looks to me like commits are stored in

.git/objects 

The .git objects folder contains a bunch of folders with names like a6 and 9b. These folders each contain a file with a name that looks like a commit sha 2f29598814b07fea915514cfc4d05129967bf7. When I open one of those files in a text editor, I get gibberish.

  1. What file format is this gibberish / How is a git commit object stored?
  2. In this git commit log, the folder 9b contains one commit sha

    aed8a9f773efb2f498f19c31f8603b6cb2a4bc
    

    Why, and is there a case where more than one commit sha would be stored in the file 9b?

  3. is there a way to convert this gibberish into plain text so that I can mess with commits in a text editor?

like image 623
Tara Roys Avatar asked Apr 09 '14 16:04

Tara Roys


People also ask

What is the data structure of a Git commit?

Repository Structure There are four major types of Git objects, blobs, trees, commits, and tags. The names of these objects are all SHA-1 hashes. You can use the git cat-file -t command to view the type of each SHA-1. You can use the git cat-file -p command to view the contents and simple data structure of each object.

What are Git objects files?

Underneath the hood, git has a concept of objects. Objects are generally made up of a header plus some data. File content gets stored as a blob object. Tree objects contain filenames and point to blob objects that represent the files, and tree objects that represent other directories.

What is a commit in Git?

Commits are the core building block units of a Git project timeline. Commits can be thought of as snapshots or milestones along the timeline of a Git project. Commits are created with the git commit command to capture the state of a project at that point in time.


2 Answers

Create a minimal example and reverse engineer the format

Create a simple repository, and before any packfiles are created (git gc, git config gc.auto, git-prune-packed ...), unpack a commit object with one of the methods from: How to DEFLATE with a command line tool to extract a git object?

export GIT_AUTHOR_DATE="1970-01-01T00:00:00+0000"
export GIT_AUTHOR_EMAIL="[email protected]"
export GIT_AUTHOR_NAME="Author Name" \
export GIT_COMMITTER_DATE="2000-01-01T00:00:00+0000" \
export GIT_COMMITTER_EMAIL="[email protected]" \
export GIT_COMMITTER_NAME="Committer Name" \

git init

# First commit.
echo
touch a
git add a
git commit -m 'First message'
python -c "import zlib,sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))" \
  <.git/objects/45/3a2378ba0eb310df8741aa26d1c861ac4c512f | hd

# Second commit.
echo
touch b
git add b
git commit -m 'Second message'
python -c "import zlib,sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))" \
  <.git/objects/74/8e6f7e22cac87acec8c26ee690b4ff0388cbf5 | hd

The output is:

Initialized empty Git repository in /home/ciro/test/git/.git/

[master (root-commit) 453a237] First message
 Author: Author Name <[email protected]>
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 a
00000000  63 6f 6d 6d 69 74 20 31  37 34 00 74 72 65 65 20  |commit 174.tree |
00000010  34 39 36 64 36 34 32 38  62 39 63 66 39 32 39 38  |496d6428b9cf9298|
00000020  31 64 63 39 34 39 35 32  31 31 65 36 65 31 31 32  |1dc9495211e6e112|
00000030  30 66 62 36 66 32 62 61  0a 61 75 74 68 6f 72 20  |0fb6f2ba.author |
00000040  41 75 74 68 6f 72 20 4e  61 6d 65 20 3c 61 75 74  |Author Name <aut|
00000050  68 6f 72 40 65 78 61 6d  70 6c 65 2e 63 6f 6d 3e  |[email protected]>|
00000060  20 30 20 2b 30 30 30 30  0a 63 6f 6d 6d 69 74 74  | 0 +0000.committ|
00000070  65 72 20 43 6f 6d 6d 69  74 74 65 72 20 4e 61 6d  |er Committer Nam|
00000080  65 20 3c 63 6f 6d 6d 69  74 74 65 72 40 65 78 61  |e <committer@exa|
00000090  6d 70 6c 65 2e 63 6f 6d  3e 20 39 34 36 36 38 34  |mple.com> 946684|
000000a0  38 30 30 20 2b 30 30 30  30 0a 0a 46 69 72 73 74  |800 +0000..First|
000000b0  20 6d 65 73 73 61 67 65  0a                       | message.|
000000ba

[master 748e6f7] Second message
 Author: Author Name <[email protected]>
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 b
00000000  63 6f 6d 6d 69 74 20 32  32 33 00 74 72 65 65 20  |commit 223.tree |
00000010  32 39 36 65 35 36 30 32  33 63 64 63 30 33 34 64  |296e56023cdc034d|
00000020  32 37 33 35 66 65 65 38  63 30 64 38 35 61 36 35  |2735fee8c0d85a65|
00000030  39 64 31 62 30 37 66 34  0a 70 61 72 65 6e 74 20  |9d1b07f4.parent |
00000040  34 35 33 61 32 33 37 38  62 61 30 65 62 33 31 30  |453a2378ba0eb310|
00000050  64 66 38 37 34 31 61 61  32 36 64 31 63 38 36 31  |df8741aa26d1c861|
00000060  61 63 34 63 35 31 32 66  0a 61 75 74 68 6f 72 20  |ac4c512f.author |
00000070  41 75 74 68 6f 72 20 4e  61 6d 65 20 3c 61 75 74  |Author Name <aut|
00000080  68 6f 72 40 65 78 61 6d  70 6c 65 2e 63 6f 6d 3e  |[email protected]>|
00000090  20 30 20 2b 30 30 30 30  0a 63 6f 6d 6d 69 74 74  | 0 +0000.committ|
000000a0  65 72 20 43 6f 6d 6d 69  74 74 65 72 20 4e 61 6d  |er Committer Nam|
000000b0  65 20 3c 63 6f 6d 6d 69  74 74 65 72 40 65 78 61  |e <committer@exa|
000000c0  6d 70 6c 65 2e 63 6f 6d  3e 20 39 34 36 36 38 34  |mple.com> 946684|
000000d0  38 30 30 20 2b 30 30 30  30 0a 0a 53 65 63 6f 6e  |800 +0000..Secon|
000000e0  64 20 6d 65 73 73 61 67  65 0a                    |d message.|
000000eb

Then we deduce that the format is as follows:

  • Top level:

    commit {size}\0{content}
    

    where {size} is the number of bytes in {content}.

    This follows the same pattern for all object types.

  • {content}:

    tree {tree_sha}
    {parents}
    author {author_name} <{author_email}> {author_date_seconds} {author_date_timezone}
    committer {committer_name} <{committer_email}> {committer_date_seconds} {committer_date_timezone}
    
    {commit message}
    

    where:

    • {tree_sha}: SHA of the tree object this commit points to.

      This represents the top-level Git repo directory.

      That SHA comes from the format of the tree object: What is the internal format of a git tree object?

    • {parents}: optional list of parent commit objects of form:

      parent {parent1_sha}
      parent {parent2_sha}
      ...
      

      The list can be empty if there are no parents, e.g. for the first commit in a repo.

      Two parents happen in regular merge commits.

      More than two parents are possible with git merge -Xoctopus, but this is not a common workflow. Here is an example: https://github.com/cirosantilli/test-octopus-100k

    • {author_name}: e.g.: Ciro Santilli. Cannot contain <, \n

    • {author_email}: e.g.: [email protected]. Cannot contain >, \n

    • {author_date_seconds}: seconds since 1970, e.g. 946684800 is the first second of year 2000

    • {author_date_timezone}: e.g.: +0000 is UTC

    • committer fields: analogous to author fields

    • {commit message}: arbitrary.

I've made a minimal Python script that generates a git repo with a few commits at: https://github.com/cirosantilli/test-git-web-interface/blob/864d809c36b8f3b232d5b0668917060e8bcba3e8/other-test-repos/util.py#L83

I've used that for fun things like:

  • Who is the user with the longest streak on GitHub?
  • https://www.quora.com/Which-GitHub-repo-has-the-most-commits/answer/Ciro-Santilli
  • https://github.com/isaacs/github/issues/1344

Here is an analogous analysis of the tag object format: What is the format of a git tag object and how to calculate its SHA?


Before you head down this path much further, I might recommend that you read through the section in the Git Manual about its internals. I find that knowing the contents of this chapter is usually the difference between liking Git and hating it. Understanding why Git is doing things the way it does often makes all of the sort of weird commands it has for things make more sense.

To answer your question, the gibberish that you are seeing is the data for the object after it has been compressed using zlib. If you look under the heading "Object Storage" in the link above you can see some details about how this works. This is the short version of how files are stored in git:

  1. Create a git specific header for the content.
  2. Generate a hash of the concatenation of the header + content.
  3. Compress the concatenation of the header + content.
  4. Store the compressed data to disk in a folder with a name equal to the first two characters of the data's hash and a file name with the remaining 38 characters.

So that answers your second question, a folder will contain all of the compressed objects that begin with the same two characters, regardless of their contents.

If you want to see the contents of a blob, all you have to do is decompress it. If you just want to view the contents of the file, this can be done easily enough in most programming languages. I would warn you against trying to modify data, however. Modifying even a single byte in a file will change it's hash. All of the metadata in git (namely, directory structures and commits) are stored using references to hashes, so modifying a single file means that you must also update all objects downstream from that file that reference that file's hash. Then you have to update all the objects that reference those hashes. And on, and on, and on... Trying to achieve this becomes very, very complicated very quickly. You'll save your self a lot of time and heartache by just learning git's built in commmands.

like image 39
TwentyMiles Avatar answered Sep 29 '22 07:09

TwentyMiles