I have now heard this mentioned a couple of times, that Git provides data integrity. But what does that mean? I understand that all objects in git are accessed using a SHA-1 checksum and that this checksum is computed based on the content of the file. This means that if the file has changes you will get at different checksum. But how does that provide data integrity? If I look up some data based on a checksum (key) will git return an error if its not found (if it has somehow become corrupted). I assume that data can still become corrupted when using git - disk read errors etc. Don't really see the difference to e.g. SVN here or how data integrity is provided practically in Git.

<blockquote> But how does that provide data integrity? If I look up some data based on a checksum (key) will git return an error if its not found (if it has somehow become corrupted). </blockquote> Yes. And by data, I don't mean just the content of a file, but the full history (ie if the parent version of a given version of a file changes, all the SHA1 associated to the Git repo change. The data is still there, but other part of its history have changed. <blockquote> I assume that data can still become corrupted when using git - disk read errors etc. </blockquote> That is one example of corruption. But even if the data remains intact, the integrity is also against any change (even in metadata like author or committer name or date: change one of those and the SHA1 will change as well) That is because of the the DAG graph of data which compose a git repo: <img src="https://i.stack.imgur.com/v3N7d.png" alt="http://git-scm.com/book/en/v2/book/10-git-internals/images/data-model-3.png"> (image from "Git Internals - Git Objects") If you modify any of those items, their parents change as well.

Data integrity in Git?

Tags:

git

I have now heard this mentioned a couple of times, that Git provides data integrity. But what does that mean?

I understand that all objects in git are accessed using a SHA-1 checksum and that this checksum is computed based on the content of the file. This means that if the file has changes you will get at different checksum.

But how does that provide data integrity? If I look up some data based on a checksum (key) will git return an error if its not found (if it has somehow become corrupted). I assume that data can still become corrupted when using git - disk read errors etc.

Don't really see the difference to e.g. SVN here or how data integrity is provided practically in Git.

992

asked Dec 12 '14 09:12

u123

2 Answers

If I look up some data based on a checksum (key) will git return an error if its not found (if it has somehow become corrupted).

Essentially, yes. Suppose that the original correct data checksums to 1234. Git stores this checksum and looks up the data by that checksum. (This is how its "content addressable" thing works: one generally starts with, e.g., a branch name like master, which maps to a commit ID like 56789ab.... This mapping is kept in git's "refs", which are more vulnerable than the rest of the data, but let's assume for the moment that this part remains intact.)

Git then extracts the commit by ID, and compares the checksum of the contents to the ID. This must match, or the commit contents are corrupted. Assuming the contents are valid, they contain a (single) tree ID (plus information about the commit: who made it, when, its parents, and so on).

Git then extracts the tree contents by ID, and compares the checksum of the contents to the ID. This must match, or the tree contents are corrupted. Assuming the contents are valid, they contain a series of tuples giving file modes, names, and IDs. For each line, the mode distinguishes between additional trees or plain files ("blobs"). The name is the name of the sub-tree or file, and the ID is the checksum of the contents.

Git then extracts the sub-tree or blob contents by ID, and compares the checksum. This must match, or the contents are corrupted. Assuming the contents are valid, a sub-tree is handled recursively as before, and a file is correct (not compromised).

Note that along the way, any caught error simply tells you that something has gone wrong, but it does not correct the problem; for that, you need a backup (such as another copy of the repository). If the failure occurs fairly far along the process, it's clear that it's the data that are corrupt, since the checksums were valid long enough to find a commit and a tree and perhaps several sub-trees before the failure.

If the references are corrupted, they are hard to reconstruct. However, git can walk every object in the data-base and see if any are "unreferenced". Such objects are candidates for where the corrupted references should point. Actually fixing this, in practice, is usually pointlessly hard: you simply go to the same backup you would use in the case of a corrupted blob.

answered Nov 15 '22 07:11

torek

But how does that provide data integrity? If I look up some data based on a checksum (key) will git return an error if its not found (if it has somehow become corrupted).

Yes.

And by data, I don't mean just the content of a file, but the full history (ie if the parent version of a given version of a file changes, all the SHA1 associated to the Git repo change. The data is still there, but other part of its history have changed.

I assume that data can still become corrupted when using git - disk read errors etc.

That is one example of corruption.
But even if the data remains intact, the integrity is also against any change (even in metadata like author or committer name or date: change one of those and the SHA1 will change as well)

That is because of the the DAG graph of data which compose a git repo:

(image from "Git Internals - Git Objects")

If you modify any of those items, their parents change as well.

answered Nov 15 '22 07:11

VonC

Related questions
                            
                                GIT checking out code from output of "git describe"
                            
                                how to update submodule url in all commits
                            
                                Hosting a subset of boost used in project on github
                            
                                Show base in fugitive.vim conflict diff
                            
                                Node.js + Git: How To Avoid Adding Module Dependencies to Repository
                            
                                Git commit lost after reset --hard. Not found by fsck, not in reflog
                            
                                Find out which commit removed a particular word/line in GIT
                            
                                Any Git shortcuts for the current branch and the branch it tracks?
                            
                                Failed to push some refs when pushing feature branch
                            
                                Does each branch have a separate stash?
                            
                                How to link to a Git commit in Redmine after pushing
                            
                                Should I include .sass-cache with the repo when versioning a web project?
                            
                                Best practice - Git + Build automation - Keeping configs separate
                            
                                In git-flow, why does master even exist?
                            
                                Create archive of modified files in GIT via batch file
                            
                                Diff branches in Atlassian Stash
                            
                                Deleting a local commit?
                            
                                How do you checkout a branch with pygit2?
                            
                                git uses root instead of my username while commit linux
                            
                                How to turn on git auto-fetch?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With