Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hash collision in git

What would actually happen if I had a hash collision while using git?

E.g. I manage to commit two files with the same sha1 checksum, would git notice it or corrupt one of the files?

Could git be improved to live with that, or would I have to change to a new hash algorithm?

(Please do not deflect this question by discussing how unlikely that is - Thanks)

like image 753
Sec Avatar asked May 03 '12 15:05

Sec


People also ask

What happens if Git hash collision?

It will fail when creating the commit. If a tree object already exists and you make a commit object with the same hash, it will fail when updating the "ref".

What is a collision in a hash?

Definition: A collision occurs when more than one value to be hashed by a particular hash function hash to the same slot in the table or data structure (hash table) being generated by the hash function.

What hash algorithm does Git use?

As is well-known, Git has been using SHA-1 to calculate a hash for each commit: For example, files, directories, and revisions are referred to by hash values unlike in other traditional version control systems where files or versions are referred to via sequential numbers.

What is hash in Git?

Hashes are what enable Git to share data efficiently between repositories. If two files are the same, their hashes are guaranteed to be the same. Similarly, if two commits contain the same files and have the same ancestors, their hashes will be the same as well.


3 Answers

Picking atoms on 10 Moons

An SHA-1 hash is a 40 hex character string... that's 4 bits per character times 40... 160 bits. Now we know 10 bits is approximately 1000 (1024 to be exact) meaning that there are 1 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 different SHA-1 hashes... 1048.

What is this equivalent of? Well the Moon is made up of about 1047 atoms. So if we have 10 Moons... and you randomly pick one atom on one of these moons... and then go ahead and pick a random atom on them again... then the likelihood that you'll pick the same atom twice, is the likelihood that two given git commits will have the same SHA-1 hash.

Expanding on this we can ask the question...

How many commits do you need in a repository before you should start worrying about collisions?

This relates to so called "Birthday attacks", which in turn refers to the "Birthday Paradox" or "Birthday Problem", which states that when you pick randomly from a given set, you need surprisingly few picks before you are more likely than not to have picked something twice. But "surprisingly few" is a very relative term here.

Wikipedia has a table on the probability of Birthday Paradox collisions. There is no entry for a 40 character hash. But an interpolation of the entries for 32 and 48 characters lands us in the range of 5*1022 git commits for a 0.1% probability of a collision. That is fifty thousand billion billion different commits, or fifty Zettacommits, before you have reached even a 0.1% chance that you have a collision.

The byte sum of the hashes alone for these commits would be more data than all the data generated on Earth for a year, which is to say you would need to churn out code faster than YouTube streams out video. Good luck with that. :D

The point of this is that unless someone is deliberately causing a collision, the probability of one happening at random is so staggeringly small you can ignore this issue

"But when a collision does occur, then what actually happens?"

Ok, suppose the improbable does happen, or suppose someone managed to tailor a deliberate SHA-1 hash collision. What happens then?

In that case there is an excellent answer where someone experimented on it. I will quote from that answer:

  1. If a blob already exists with the same hash, you will not get any warnings at all. Everything seems to be ok, but when you push, someone clones, or you revert, you will lose the latest version (in line with what is explained above).
  2. If a tree object already exists and you make a blob with the same hash: Everything will seem normal, until you either try to push or someone clones your repository. Then you will see that the repo is corrupt.
  3. If a commit object already exists and you make a blob with the same hash: same as #2 - corrupt
  4. If a blob already exists and you make a commit object with the same hash, it will fail when updating the "ref".
  5. If a blob already exists and you make a tree object with the same hash. It will fail when creating the commit.
  6. If a tree object already exists and you make a commit object with the same hash, it will fail when updating the "ref".
  7. If a tree object already exists and you make a tree object with the same hash, everything will seem ok. But when you commit, all of the repository will reference the wrong tree.
  8. If a commit object already exists and you make a commit object with the same hash, everything will seem ok. But when you commit, the commit will never be created, and the HEAD pointer will be moved to an old commit.
  9. If a commit object already exists and you make a tree object with the same hash, it will fail when creating the commit.

As you can see some cases are not good. Especially cases #2 and #3 mess up your repository. However, it does seem that the fault stays within that repository, and the attack or bizarre improbability does not propagate to other repositories.

Also, it seems that the issue of deliberate collisions is being recognised as a real threat, and so for instance GitHub is taking measures to prevent it.

like image 81
MichaelK Avatar answered Oct 08 '22 09:10

MichaelK


If two files have the same hash sum in git, it would treat those files as identical. In the absolutely unlikely case this happens, you could always go back one commit, and change something in the file so they wouldn't collide anymore ...

See Linus Torvalds' post in the thread “Starting to think about sha-256?” in the git mailing list.

like image 39
klaustopher Avatar answered Oct 08 '22 09:10

klaustopher


It's not really possible to answer this question with the right "but" without also explaining why it's not a problem. It's not possible to do that without really having a good grip on what a hash really is. It's more complicated than the simple cases you might have been exposed to in a CS program.

There is a basic misunderstanding of information theory here. If you reduce a large amount of information into a smaller amount by discarding some amount (ie. a hash) there will be a chance of collision directly related to the length of the data. The shorter the data, the LESS likely it will be. Now, the vast majority of the collisions will be gibberish, making them that much more likely to actually happen (you would never check in gibberish...even a binary image is somewhat structured). In the end, the chances are remote. To answer your question, yes, git will treat them as the same, changing the hash algorithm won't help, it'll take a "second check" of some sort, but ultimately, you would need as much "additional check" data as the length of the data to be 100% sure...keep in mind you would be 99.99999....to a really long number of digits.... sure with a simple check like you describe. SHA-x are cryptographically strong hashes, which means is't generally hard to intentionally create two source data sets that are both VERY SIMILAR to each other, and have the same hash. One bit of change in the data should create more than one (preferably as many as possible) bits of change in the hash output, which also means it's very difficult (but not quite impossible) to work back from the hash to the complete set of collisions, and thereby pull out the original message from that set of collisions - all but a few will be gibberish, and of the ones that aren't there's still a huge number to sift through if the message length is any significant length. The downside of a crypto hash is that they are slow to compute...in general.

So, what's it all mean then for Git? Not much. The hashes get done so rarely (relative to everything else) that their computational penalty is low overall to operations. The chances of hitting a pair of collisions is so low, it's not a realistic chance to occur and not be detected immediately (ie. your code would most likely suddenly stop building), allowing the user to fix the problem (back up a revision, and make the change again, and you'll almost certainly get a different hash because of the time change, which also feeds the hash in git). There is more likely for it to be a real problem for you if you're storing arbitrary binaries in git, which isn't really what it's primary use model is. If you want to do that...you're probably better off using a traditional database.

It's not wrong to think about this - it's a good question that a lot of people just pass off as "so unlikely it's not worth thinking about" - but it's really a little more complicated than that. If it DOES happen, it should be very readily detectible, it won't be a silent corruption in a normal workflow.

like image 24
Steve Avatar answered Oct 08 '22 08:10

Steve