Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the mathematical structure that represents a Git repo

Tags:

git

graph

I am learning about Git, and it would be great if I had a description of the mathematical structure that represents a Git repo. For instance: it's a directed acyclic graph; its nodes represent commits; its nodes have labels (at most one label per node, no label used twice) that represent branches, etc. (I know this description is not correct, I'm just trying to explain what I'm looking for.)

like image 306
max Avatar asked Sep 03 '13 08:09

max


People also ask

What are the parts of the data structure of a Git repository?

Within a repository, Git maintains two primary data structures, the object store and the index. All of this repository data is stored at the root of your working directory in a hidden subdirectory named .

How is Git structure?

The Git file directory is structured like a tree. It starts with a HEAD that points to the latest commit on the working branch. The HEAD will always show you where you are working from on Git Bash. Imagine you are driving and arrive at a crossroads and point your car in one direction.

What is in a Git repository?

A Git repository tracks and saves the history of all changes made to the files in a Git project. It saves this data in a directory called . git , also known as the repository folder. Git uses a version control system to track all changes made to the project and save them in the repository.

What is repository in Git bash?

In Git, the repository is like a data structure used by VCS to store metadata for a set of files and directories. It contains the collection of the files as well as the history of changes made to those files. Repository in Git is considered as your project folder.


1 Answers

In addition to the links in Nevik Rehnel's comment (copied here per request: eagain.net/articles/git-for-computer-scientists and gitolite.com/gcs), and sehe's point that the commit graph forms a Merkle Tree, I'll add a few notes.

  • There are four object types in the object-store: commit, tree, annotated-tag, and blob (file).
  • A commit object contains exactly one tree-ref (which of course can point to more trees), a possibly-empty list of parent SHA-1 hashes (which must all be more commits), an author (name, email, and timestamp), a committer (same form as author), and the commit text.
  • A tree object contains a list of (mode, sub-object, filename) repeated 0-or-more-times. If the sub-object is another tree the filename represents a directory. If it's a blob, it represents a file. The mode looks like a POSIX file mode and if it's 120000 (the file mode for a symlink), the file's "contents" are really the symlink target. Some mode value is (ab)used for submodules, but I forget which. R and W mode bits are not stored, only X bits (and even then they're ignored if the repo configuration says to ignore them).
  • An annotated-tag object contains an object reference, a tagger (name, email, and timestamp), and the tag text. The referenced object is normally a commit but a tag object can point to any object (even another tag object).
  • The labels (branches and tags and reflog-references and so on) live outside the object-store. For annotated tags, there's a label outside, pointing to the annotated tag object inside the object-store. For a lightweight tag, the outside label points right to a commit.
  • There is no restriction that there be only one root commit. Any commit with no parents is a root.
  • Git almost never makes an empty tree (which would represent an empty directory), except for two cases: there's an empty tree at all times in every repo, and if you make an initial empty commit (with git commit --allow-empty) it uses that empty tree. (Since the empty tree has no sub-objects, its SHA-1 hash value is a constant.)
  • The "DAG" description is generally meant for the trees formed by closing over commit parent hashes. However, a tree object should in general not contain itself in any of its subtrees, and if you managed to make a cyclic tree structure you would not be able to check it out (because it recurses infinitely). Assuming you cannot make two different trees with the same checksum (if you could you'd break git), you won't find a tree T1 that contains a tree T2 that contains a different tree whose checksum is T1. So the trees are implicitly a DAG too, and being attached to commit-DAGs, they form a bigger DAG. :-)
  • Unreferenced objects in the object-store will get garbage-collected by git gc. The empty tree appears to be immune to collection. Anything in the refs/ and logs/ directories and the file packed-refs (in .git, or for bare repos or when $GIT_DIR is set, wherever else) acts as a reference, as do the special names (HEAD, ORIG_HEAD, etc.); I'm not sure if other random files, if created in .git and containing valid SHA-1s, would act as references, or not.
  • The index has some format I've never dug into. It contains references to objects in the object store. When you git add a file, git drops the file into the object-store and places the (non-text) SHA-1 hash into the index file. These are valid references that prevent garbage collection.
like image 52
torek Avatar answered Nov 08 '22 08:11

torek