Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does git assure that commit SHA keys for identical operations/data are still unique?

Tags:

git

uuid

sha

If I create a file foo with touch foo and then run shasum foo it will print out

da39a3ee5e6b4b0d3255bfef95601890afd80709.

No matter how often I run shasum foo or if I run it on a different computer it will always print da39a3ee5e6b4b0d3255bfef95601890afd80709 because, yep, it's the SHA1 representation of exactly the same contents. Empty contents in this case :)

However, if I do the following steps:

cd /some/where
mkdir demo
git init
touch foo
git add -A
git commit -m "adding foo"

..and remember the SHA key of the commit (e.g. 959c363ed4cf147725360532454bc258c964c744).

Now, when I delete demo and repeat the exact same steps, still the commit SHA key will be different. And that is great and it's important to assure identity.

What I would like to know though is, what exactly does git do to assure commit hashes are always unique even if they do the exact same operations with exactly the same contents. Does git simply use something like uuidgen to generate a unique id for the commit object or does it do something different based on a combination of a timestamp, your mac address, your wifi signals etc pp.

like image 822
Christoph Avatar asked Aug 04 '14 21:08

Christoph


People also ask

Is Git SHA unique?

The SHA1 of the commit is the hash of all the information. And because this hash is unique to its content, a commit can't change. If you change any data about the commit, it will have a new SHA1. Even if the files don't change, the created date will.

How can different commits in Git be uniquely identified?

Each commit object in GIT has a unique hash. This hash is a 40 characters checksum hash. It is based on SHA1 hashing algorithm. We can use a hash to uniquely identify a GIT commit.

What does Git commit do what's a SHA in the context of Git?

A commit, or "revision", is an individual change to a file (or set of files). It's like when you save a file, except with Git, every time you save it creates a unique ID (a.k.a. the "SHA" or "hash") that allows you to keep record of what changes were made when and by who.

Is Git Short commit hash unique?

The only requirement for a git short commit hash is that it's unique within that repository, the git client will use variable length (default: 7) while GitLab seems to always use 8 characters.


2 Answers

What I would like to know though is, what exactly does git do to assure commit hashes are always unique even if they do the exact same operations with exactly the same contents.

Nothing. If you create the same contents, you get the same SHA-1.

First, however, you need to realize that "same contents" of a commit means that—provided you don't get an accidental SHA-1 collision1 or find a way to break SHA-1—you must create the same complete repository history leading up to and including the commit itself, including all the same trees, author-names, time-stamps, and so on.

This is because the contents of a commit are what you see if you run git cat-file -p <sha-1> on a commit (plus the tag-and-size field that says "this object is of type commit", so that there are no trivial ways to break things by creating a blob with the same contents as a previous commit). Here's one as an example:

$ git cat-file -p 996b0fdbb4ff63bfd880b3901f054139c95611cf
tree e760f781f2c997fd1d26f2779ac00d42ca93f534
parent 6da748a7cebe3911448fabf9426f81c9df9ec54f
parent 740c281d21ef5b27f6f1b942a4f2fc20f51e8c7e
author Junio C Hamano <[email protected]> 1406140600 -0700
committer Junio C Hamano <[email protected]> 1406140600 -0700

Sync with v2.0.3

* maint:
  Git 2.0.3
  .mailmap: combine Stefan Beller's emails
  git.1: switch homepage for stats

Note that this string includes the tree and its SHA-1, both of this commit's parent SHA-1s, the author and timestamp, the committer and timestamp, and the message. If you change even a single bit of this—such as by trying to change the underlying tree, or using some different parent commit(s)—you will get a new, different SHA-1, rather than 996b0fdbb4ff63bfd880b3901f054139c95611cf.

So the answer to this:

So in theory if me and you do exactly the same steps at exactly the same time with exactly the same configured author, email etc, we would actually get the same commit SHA key?

is "yes". However ... you must start with the same staging area (this is what will become the tree), and the same parent commits. If you then configure your author, email, etc., exactly the same as the other guy, and both of you create a new commit at the same second (or using git's environment variables2 to force the time stamps), you both get the same new commit.

Which is precisely what we want. It doesn't matter if you create it, when you're named "me", or I create it, when I'm named "me", if all the rest of the contents are the same. Because whoever creates it, the other "me" can clone it, and then we both have the same thing that way too.

(If I want to be sure that the "me" that creates something is not confused with the real me, I need to add something unique, that I know and the other me doesn't. Of course, if I publish this thing somewhere, the other me know knows it. But this is what signed, annotated tags are for. They can contain a GPG signature.)


1The chances of an accidental hash collision (for any pair of objects; chances rise with more objects) are 1 out of 2160, which is ... very small. :-) The rise is actually very rapid, so that by the time you have a million objects, it's about 1 out of 2121. The formula I use here is:

1 - exp(((-(n * (n-1))) / (2 * r))

where r = 2160 and n is the number of objects. Without the subtraction from 1, the equation calculates the "safety margin", as it were: the chance that we won't have an accidental hash collision. If we want to keep this number in the same range as the safety margin that a disk drive won't read back the wrong contents for a file—or at least, that disk-makers claim—we need to keep it around 10-18, which means we need to avoid putting more than about 1.7 quadrillion (1.7E15) objects in our git databases.

2There are many git environment variables that you can set to override various defaults. The ones for the author and committer, including date and email, are:

  • GIT_AUTHOR_NAME
  • GIT_AUTHOR_EMAIL
  • GIT_AUTHOR_DATE
  • GIT_COMMITTER_NAME
  • GIT_COMMITTER_EMAIL
  • GIT_COMMITTER_DATE
  • EMAIL

as described in the git commit-tree documentation.

like image 168
torek Avatar answered Oct 12 '22 02:10

torek


It doesn't, but you will have to manually construct the commit to get the timestamps to line up. You can manually construct a whole valid repository identical to another, by editing the .git/objects files, but because newer commits contain the hashes of older commits this will of course have to be exactly identical.

like image 40
U2EF1 Avatar answered Oct 12 '22 00:10

U2EF1