Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what's the chance that two git commit have same `abbrev-commit`

Tags:

git

what's the chance that two git commit have same abbrev-commit

I see git history default show git abbrev-commit for simplification and beauty. But what's the chance that two same abbrev-commit appear in one git repo ?

like image 965
JerryZhou Avatar asked Dec 06 '22 09:12

JerryZhou


2 Answers

There are several different ways to answer the question. One uses math to tell you what the chance would be under various proposed conditions. Another is to ask what Git actually does, but when asking that question, the answer depends on your specific Git version.

The math answer

The chance depends on the length of the abbreviation and the number of objects in your repository. (In some cases, if you know the desired object type, you can disambiguate collisions if the potential matches describe different object types. In this case you can simply reduce the value n in the formula below.)

Since StackOverflow doesn't format LaTeX, I have a screenshot here from page 77 of my own (in-progress) book. I made this a bit too big—sorry:

snippet from p.77

To find the number you want, substitute in the correct value for n and r and evaluate p-bar, then subtract that from 1. N is the number of objects:

$ git count-objects -v
count: 49
size: 568
in-pack: 307916
packs: 40
size-pack: 176024
prune-packable: 0
garbage: 0
size-garbage: 0

This repository has about 300,000 objects (most being packed; there are only 49 loose objects), so n is about 300k. Your repository will of course differ.

Then, plug in the correct value for r. The value for r if you use a full hash is 2160, or 1461501637330902918203684832716283019655932542976. If you abbreviate the hash to four characters—this is the minimum Git will accept as input—it's 216 or 65536, as each character supplies 4 bits. The full hash is 40 characters long, hence the 160 in the full-hash formula.

What Git actually does

If you use git rev-parse --short=number or git log --abbrev=number --abbrev-commit, it's on you to pick the length. If you did not supply a number, Git picks a number, using an inadequate formula.1 But it doesn't just use that number!

Modern Git checks whether the abbreviated hash is unique in the current database. This is not a probabilistic guess, it's just a literal test, performed in a loop:

length = <whatever>
loop {
    generate short hash using <length> characters
    is short hash unambiguous? if so, we're done - exit the loop
    increment length
}

so that there's no chance of a collision with the objects you have now.

Unfortunately, if you add one more object, the new one might collide with the abbreviated hash generated based on the old ones. Use the formula above to compute this probability, knowing that all the existing keys did not collide, plus the value of r implied by the length of the abbreviated hash. It's probably still pretty good, since even 4 characters gets you 1-out-of-65536. But note that it worsens rapidly as you add more objects.

This check-in-a-loop code was there when Linus Torvald's first bit of code went in to what became Git 2.11. I'm not sure how far back one has to go to where it doesn't happen, but it definitely didn't happen in some very old Git versions.


1As of Git 2.11, Git used the fact that for a large number n of keys, the 50 percent collision rate occurs at n = sqrt(r). Linus Torvalds added this code:

+       if (len < 16 && !status && (flags & GET_SHA1_AUTOMATIC)) {
+               unsigned int expect_collision = 1 << (len * 2);
+               if (ds.nrobjects > expect_collision) {
+                       default_automatic_abbrev = len+1;
+                       return SHORT_NAME_AMBIGUOUS;
+               }
+       }

in commit e6c587c733 for Git 2.11. It was subsequently improved in commit 8e3f52d778. But 50% is much too high a probability.

like image 143
torek Avatar answered Dec 08 '22 00:12

torek


Zero. Or rather: the same chances that you have a SHA1 conflict in your repo.

When a git command returns a list of abbreviated refs, if it finds that two abbreviations (prefixes of hash digests) are identical, it adds more characters from the full SHA1 hash to these particular refs until they are no longer identical.

like image 31
root Avatar answered Dec 07 '22 22:12

root