I have now heard this mentioned a couple of times, that Git provides data integrity. But what does that mean?
I understand that all objects in git are accessed using a SHA-1 checksum and that this checksum is computed based on the content of the file. This means that if the file has changes you will get at different checksum.
But how does that provide data integrity? If I look up some data based on a checksum (key) will git return an error if its not found (if it has somehow become corrupted). I assume that data can still become corrupted when using git - disk read errors etc.
Don't really see the difference to e.g. SVN here or how data integrity is provided practically in Git.
Assuming the contents are valid, they contain a (single) tree ID (plus information about the commit: who made it, when, its parents, and so on). Git then extracts the tree contents by ID, and compares the checksum of the contents to the ID. This must match, or the tree contents are corrupted.
Git Has Integrity Everything in Git is checksummed before it is stored and is then referred to by that checksum. This means it's impossible to change the contents of any file or directory without Git knowing about it.
Checksums in Git In fact, the checksum is used as commit identifier and commonly referred to as "the SHA". Git's checksums include meta data about the commit including the author, date, and the previous commit's SHA. Git assures the integrity of the data being stored by using checksums as identifiers.
Git doesn't think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini filesystem. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot.
If I look up some data based on a checksum (key) will git return an error if its not found (if it has somehow become corrupted).
Essentially, yes. Suppose that the original correct data checksums to 1234. Git stores this checksum and looks up the data by that checksum. (This is how its "content addressable" thing works: one generally starts with, e.g., a branch name like master
, which maps to a commit ID like 56789ab...
. This mapping is kept in git's "refs", which are more vulnerable than the rest of the data, but let's assume for the moment that this part remains intact.)
Git then extracts the commit by ID, and compares the checksum of the contents to the ID. This must match, or the commit contents are corrupted. Assuming the contents are valid, they contain a (single) tree ID (plus information about the commit: who made it, when, its parents, and so on).
Git then extracts the tree contents by ID, and compares the checksum of the contents to the ID. This must match, or the tree contents are corrupted. Assuming the contents are valid, they contain a series of tuples giving file modes, names, and IDs. For each line, the mode distinguishes between additional trees or plain files ("blobs"). The name is the name of the sub-tree or file, and the ID is the checksum of the contents.
Git then extracts the sub-tree or blob contents by ID, and compares the checksum. This must match, or the contents are corrupted. Assuming the contents are valid, a sub-tree is handled recursively as before, and a file is correct (not compromised).
Note that along the way, any caught error simply tells you that something has gone wrong, but it does not correct the problem; for that, you need a backup (such as another copy of the repository). If the failure occurs fairly far along the process, it's clear that it's the data that are corrupt, since the checksums were valid long enough to find a commit and a tree and perhaps several sub-trees before the failure.
If the references are corrupted, they are hard to reconstruct. However, git can walk every object in the data-base and see if any are "unreferenced". Such objects are candidates for where the corrupted references should point. Actually fixing this, in practice, is usually pointlessly hard: you simply go to the same backup you would use in the case of a corrupted blob.
But how does that provide data integrity? If I look up some data based on a checksum (key) will git return an error if its not found (if it has somehow become corrupted).
Yes.
And by data, I don't mean just the content of a file, but the full history (ie if the parent version of a given version of a file changes, all the SHA1 associated to the Git repo change. The data is still there, but other part of its history have changed.
I assume that data can still become corrupted when using git - disk read errors etc.
That is one example of corruption.
But even if the data remains intact, the integrity is also against any change (even in metadata like author or committer name or date: change one of those and the SHA1 will change as well)
That is because of the the DAG graph of data which compose a git repo:
(image from "Git Internals - Git Objects")
If you modify any of those items, their parents change as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With