This is about the internals of git
.
I've been reading the great 'Pro Git' book and learning a little about how git is working internally (all about the SHA1, blobs, references, trees, commits, etc, etc). Pretty clever architecture, by the way.
So, to put into context, git references the content of a file as a SHA1 value, so it's able to know if a specific content has changed just comparing the hash values. But my question is specifically about how git checks that the content in the working tree has changed or not.
The naive approach will be thinking that each time you run a command as git status
or similar command, it will search through all the files on the working directory, calculating the SHA1 and comparing it with the one that has the last commit. But that seems very inefficient for big projects, as the Linux kernel.
Another idea could be to check last modification date on the file, but I think git is not storing that information (when you clone a repository, all the files have a new time)
I'm sure it's doing it in an efficient way (git is really fast), does anyone know how that is achieved?
PD: Just to add an interesting link about the git index, specifically stating that the index keeps information about files timestamps, even when the tree objects do not.
In Git, there are four diff algorithms, namely Myers, Minimal, Patience, and Histogram, which are utilized to obtain the differences of the two same files located in two different commits. The Minimal and the Histogram algorithms are the improved versions of the Myers and the Patience respectively.
Indexing. For every tracked file, Git records information such as its size, creation time and last modification time in a file known as the index. To determine whether a file has changed, Git compares its current stats with those cached in the index. If they match, then Git can skip reading the file again.
Diffing is a function that takes two input data sets and outputs the changes between them. git diff is a multi-use Git command that when executed runs a diff function on Git data sources. These data sources can be commits, branches, files and more.
Git’s index maintains timestamps of when git last wrote each file into the working tree (and updates these whenever files are cached from the working tree or from a commit). You can see the metadata with git ls-files --debug
. In addition to the timestamp, it records the size, inode, and other information from lstat to reduce the chance of a false positive.
When you perform git-status, it simply calls lstat on every file in the working tree and compares the metadata in order to quickly determine which files are unchanged. This is described in the documentation under racy-git and update-index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With