Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What algorithm does git use to detect changes on your working tree?

Tags:

git

This is about the internals of git.

I've been reading the great 'Pro Git' book and learning a little about how git is working internally (all about the SHA1, blobs, references, trees, commits, etc, etc). Pretty clever architecture, by the way.

So, to put into context, git references the content of a file as a SHA1 value, so it's able to know if a specific content has changed just comparing the hash values. But my question is specifically about how git checks that the content in the working tree has changed or not.

The naive approach will be thinking that each time you run a command as git status or similar command, it will search through all the files on the working directory, calculating the SHA1 and comparing it with the one that has the last commit. But that seems very inefficient for big projects, as the Linux kernel.

Another idea could be to check last modification date on the file, but I think git is not storing that information (when you clone a repository, all the files have a new time)

I'm sure it's doing it in an efficient way (git is really fast), does anyone know how that is achieved?

PD: Just to add an interesting link about the git index, specifically stating that the index keeps information about files timestamps, even when the tree objects do not.

like image 432
Khelben Avatar asked Nov 02 '10 06:11

Khelben


People also ask

What algorithm does Git use?

In Git, there are four diff algorithms, namely Myers, Minimal, Patience, and Histogram, which are utilized to obtain the differences of the two same files located in two different commits. The Minimal and the Histogram algorithms are the improved versions of the Myers and the Patience respectively.

How does Git detect file changes?

Indexing. For every tracked file, Git records information such as its size, creation time and last modification time in a file known as the index. To determine whether a file has changed, Git compares its current stats with those cached in the index. If they match, then Git can skip reading the file again.

How does Git diff work internally?

Diffing is a function that takes two input data sets and outputs the changes between them. git diff is a multi-use Git command that when executed runs a diff function on Git data sources. These data sources can be commits, branches, files and more.


1 Answers

Git’s index maintains timestamps of when git last wrote each file into the working tree (and updates these whenever files are cached from the working tree or from a commit). You can see the metadata with git ls-files --debug. In addition to the timestamp, it records the size, inode, and other information from lstat to reduce the chance of a false positive.

When you perform git-status, it simply calls lstat on every file in the working tree and compares the metadata in order to quickly determine which files are unchanged. This is described in the documentation under racy-git and update-index.

like image 60
Josh Lee Avatar answered Oct 02 '22 14:10

Josh Lee