Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git hash duplicates

Git allows to retrieve the hash of the commit with commands like:

git rev-parse HEAD

which gives 33b316c or

git rev-parse --short HEAD

which gives 33b316cbeeab3d69e79b9fb659414af4e7829a32 I know that long hashes in practice will never collide.

In practice, the short hashes are used much more often. I'd like to know what's the probability for the short ones to collide? Does git take any measures to overcome possible collisions (when for example using git checkout)?

like image 998
GA1 Avatar asked Mar 05 '23 01:03

GA1


1 Answers

I give a formula in my book—see pp. 78-79—but if you're looking for a simple one, the point at which the probability of some hash collision reaches about 50% in an n-bit hash is when you hash roughly 2n/2 keys. The SHA-1 hash itself is 160 bits, represented as 40 hexadecimal digits, each representing 4 of the 160 bits. Truncating that to 7 hexadecimal digits leaves 28 bits, so you will reach 50%-chance-of-collision at about 214 keys, or 16384 objects. If you constrain the objects to be only commits, that's a pretty decent number of commits, but Git places all objects—commits, trees, annotated tag objects, and blobs—in a single hash-indexed key-value store.

The probability of the hashes of any given pair of keys colliding is just 1 in 2n, i.e., 1 in 228 or 1 out of 268 million. The reason it increases so fast to 50%, as the number of keys grows, is known as the Birthday Paradox or birthday problem. 50% is of course far too scary; with 28 bits, if we want the overall probability to be below 0.1%, we should keep the number of objects below about 1230. By going to 32 bits (8 character abbrevations) we double this to about 2460, but that's still not very many objects.

By the time you have 16k objects in your store, you probably should use at least 10 hexadecimal digits, giving 240 possible hash values and a p-bar value of about .99987794... (about .019% chance of collisions). Nine hex digits gives only 236 hash values, producing a p-bar of .99804890... or 0.19% chance of collision, which I think is too high.

If you can restrict your ambiguous-matching code to only commits—or only commit-ish, which in Git means commits or annotated tags—the built in defaults work pretty well. (Git will in fact do this in a lot of cases.) But Git's internal code for computing the "right" abbreviation length is, at least in my opinion, far too care-free, too "loosey-goosey", as it uses the 50%-collision-probability square-root trick in contexts where the resulting hash might be used to identify any object.

(As noted in comments, internally Git always uses the full hashes. It's only at the not-Git / Git interface, e.g., git log <hash> or git show <hash> user-facing commands, that you can type in an abbreviated hash, or ask for an abbreviated output hash. Here Git will default to using the 50%-collision-probability number to compute how many characters to show, starting with an estimate of the number of objects in the database. If you're supplying the hash, you choose how much to supply. If you're asking Git to provide it, you can still choose how much, with --abbrev=number. Note that there's an absolute minimum of 4: git log abc won't treat abc as a hash ID, but git log abcd will treat abcd as an abbreviated a hash ID. There's also a very old default of 7 characters, from the Git 1.7-ish days.)

like image 55
torek Avatar answered Mar 29 '23 17:03

torek