Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does git status work internally?

Tags:

git

From the git object model, files and folders are saved to locations in the .git folder by their sha1 hash.

How does git internally know if a file has been deleted, added, or edited (specifically, how does it compute the changes you see when you type git status)? Does the system determine this information purely from the sha1?

like image 299
rookie Avatar asked Apr 28 '16 17:04

rookie


People also ask

How does git commit work internally?

After each commit, Git updates the branch reference with the hash of the new commit. When a Git repo is first initialized, a branch named "master" is automatically created with empty content (because there is no commit yet) Git internally stores the current branch name in the HEAD file (inside . git folder)

What does git status tell you?

The git status command displays the state of the working directory and the staging area. It lets you see which changes have been staged, which haven't, and which files aren't being tracked by Git. Status output does not show you any information regarding the committed project history.

How does git manage internal branches?

How does Git internally manage branches? by creating a pointer to the most recent snapshot/commit for the branch. by creating a data array of branches in the same repository. by creating a data dictionary of code changes.

What does git clone do internally?

It transfers all commits of the remote repository to the current repository and creates inside your local repository branches starting with "remote/origin/" corresponding with the branches on the remote repository.


1 Answers

CodeWizard's answer is wrong in a few important details, as Edward Thomson noted in a comment.

The super-short version is that git status runs git diff.

In fact, it runs it twice, or more precisely, it runs two different internal variations on git diff: one to compare HEAD to the index/staging-area, and one to compare the staging-area to the work-tree. It runs each diff with a request to search for renames, i.e., sets the -M flag (see below). Finally, it presents the results from these diffs to you in whichever format you requested. In no case does it show the actual changes between files, though (so in effect it runs these diffs with --name-status as well).

Using the various diffs

You can run both of these internal diffs manually: one has a front-end command spelled git diff-index --cached, and one has a front-end command spelled git diff-files. This front-end selection is captured in the slightly oddly placed section titled Raw output format (I have had to modify this a bit to display better on StackOverflow):

The raw output format from git-diff-index, git-diff-tree, git-diff-files and git diff --raw are very similar.

These commands all compare two sets of things; what is compared differs:

git-diff-index tree-ish
      compares the tree-ish and the files on the filesystem.

git-diff-index --cached tree-ish
      compares the tree-ish and the index.

git-diff-tree [-r] tree-ish-1 tree-ish-2 [pattern ...]
      compares the trees named by the two arguments.

git-diff-files [pattern ...]
      compares the index and the files on the filesystem.

(You can invoke these with regular git diff as well: git diff --cached compares the current (HEAD) commit to the staging-area, and git diff with no additional arguments compares the staging-area to the work-tree.)

Mapping trees back to paths

CodeWizard's answer has the keys to this process. Essentially, a tree object contains the path-name component (such as the foo or bar in foo/bar) and another object ID. If the component represents a directory, the object ID locates another tree object; if it represents a file, the object ID locates a blob object. In either case the ID is Git's internal name, which enables Git to find it in the repository.

(This is not true for the index/staging-area itself, whose format is not very well documented. It is a flat list of all files, with full path names but using a name compression technique as well, so that VeryLongDirectory/AnotherLongDirectory/bar followed by VeryLongDirectory/AnotherLongDirectory/baz does not have to spell out VeryLongDirectory/AnotherLongDirectory each time.)

(Tree objects also store the mode that Git should assign to the file, upon extraction, except that in the tree object, each file mode is only ever 100644 or 100755; the final rwx bits are set based on your umask, assuming a Unix-like host, with x being always-clear if the stored mode is 100644, otherwise set-except-as-cleared-by-umask.)

Unstaged files and detecting renames

How does git internally know if a file has been deleted, added, or edited (specifically, how does it compute the changes you see when you type git status)?

A file that is in the work-tree, but is neither in the HEAD commit nor in the index/staging-area is unstaged (this is in fact the definition of "unstaged"). Git finds such files by looking at all three (and using the index/staging-area for cache information to speed up the process). All the unstaged file paths are generally fed to the "ignore" code, which makes git shut up about them if they are listed in .gitignore or any of the other ignore-some-paths files.

Having dispensed with unstaged paths, let's consider the remaining paths, which (by definition) appear in at least one of HEAD or the index/staging-area.

In general—there are more flags for controlling this in finer detail, although git status does not set any of them—Git first compares the path names available in the "A" side (a/foo/bar) to those in the "B" side (b/foo/bar). If the same path appears in both sides, chances are that the file was simply modified in place, and Git starts with that assumption. If a path appears in A but not in B, and some other path appears in B but not in A, the two paths are paired up and given to to the rename detector (if it is enabled).

All the internal diffs share a bunch of code, and also share the documentation. Click on one of the above links and search for -M or --find-renames:

-M[n] --find-renames[=n]

Detect renames. If n is specified, it is a threshold on the similarity index (i.e. amount of addition/deletions compared to the file’s size). For example, -M90% means Git should consider a delete/add pair to be a rename if more than 90% of the file hasn’t changed. Without a % sign, the number is to be read as a fraction, with a decimal point before it. I.e., -M5 becomes 0.5, and is thus the same as -M50%. Similarly, -M05 is the same as -M5%. To limit detection to exact renames, use -M100%. The default similarity index is 50%.

The rename-detector can be enabled by default by setting diff.renameLimit to 0 in your configuration. Otherwise, it is currently disabled by default, but will be enabled by default in an upcoming Git release (I am not sure which one).

For further detail on the similarity matching, see this answer from Edward Thomson.

Once the rename detector decides that some A-to-B change is a rename, it pulls both names out of the "only in A" and "only in B" lists.

Added and deleted

After running the rename detector (if enabled), any files that are found only on the A side are "deleted", and any found only on the B side are "added". For git status, this concludes the whole process (except for displaying the results). For regular git diff we usually go on to produce actual diff output, when some file is modified or renamed-and-modified.

(Note that all of Git's diffs share all of this machinery, so they will all find the same set of renames, provided you turn on rename detection and set the same thresholds. These are also used during git merge.)

Side note: renames are detected, not recorded

Many other version control systems (Mercurial, ClearCase, Perforce) require that you register a file-rename with them: hg mv and so on. This is because they record the rename with each commit. A system that does this necessarily gives each file some kind of identifier (this could be a true object ID as in ClearCase, or simply "its name in the current commit", which is then munged as needed as we move from commit to commit). The advantage to this system is that the VCS can follow the file no matter how changed it gets. A disadvantage is that you must record the change, and a file that is accidentally deleted, then resurrected, can get a new ID (see ClearCase "evil twins").

Git simply re-discovers the rename, every time it goes to compare one commit to another (or a commit to the index, or the index to the work-tree, etc). This means you do not have to use git mv: you can git rm --cached the old path and git add the new one, to get the same effect. (You can, of course, use git mv whenever it is more convenient, which is most of the time. But this is a significant difference from version control systems that record, with each check-in or commit, directory modifications: with these systems you must invoke the VCS-specific mv command, such as hg mv or cleartool mv, to inform the VCS that the file moved, rather than letting the VCS figure it out later.)

like image 91
torek Avatar answered Oct 02 '22 04:10

torek