Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Probability of collision with truncated SHA-256 hash

I have a database-driven web application where the primary keys of all data rows are obfuscated as follows: SHA256(content type + primary key + secret), truncated to the first 8 characters. The content type is a simple word, e.g. "post" or "message" and the secret is a 20-30 char ASCII constant. The result is stored in a separate indexed column for fast DB lookup.

How do I calculate the probability of a hash collision in this scenario? I am not a mathematician at all, but a friend claimed that due to the Birthday Paradox the collision probability would be ~1% for 10,000 rows with an 8-char truncation. Is there any truth to this claim?

like image 690
CoderMD666 Avatar asked Nov 13 '13 19:11

CoderMD666


People also ask

What is the probability of SHA256 collision?

SHA256: The slowest, usually 60% slower than md5, and the longest generated hash (32 bytes). The probability of just two hashes accidentally colliding is approximately: 4.3*10-60. As you can see, the slower and longer the hash is, the more reliable it is.

Is it safe to truncate SHA256?

Like Tom just said, you can truncate SHA-256 output to 128 bits for integrity, because 128 bits is enough to reasonably avoid collisions. However, hash functions like SHA-256 are not suited for key generation or file authenticity (only integrity). If you want to generate a key, use something like PKBDF2 or scrypt.

Can Sha-256 have collisions?

In [8] Li et al. have shown how a particular preimage attack can be used to construct a free-start collision attack for 52 steps of SHA-256. All attacks are only slightly faster than the generic attack which has a complexity of 2256 for preimages and 2128 for free- start collisions.

What happens if SHA256 is broken?

in this scenario sha256-based cryptocurrencies will be worthless. in general: every cryptocurrency and every encryption-system will be worthless when the underlying algorithm (sha2, sha3, aes, ripemd160, whatever) is "broken" by a quantum commputer.


1 Answers

Yes, there is a collision probability & it's probably somewhat too high. The exact probability depends on what "8 characters" means.

Does "8 characters" mean:

  • A) You store 8 hex characters of the hash? That would store 32 bits.
  • B) You store 8 characters of BASE-64? That would store 48 bits.
  • C) You store 8 bytes, encoded in some single-byte charset/ or hacked in some broken way into a character encoding? That would store 56-64 bits, but if you don't do encoding right you'll encounter character conversion problems.
  • D) You store 8 bytes, as bytes? That genuinely stores 64 bits of the hash.

Storing binary data as either A) hex or D) binary bytes, would be my preferred options. But I'd definitely recommend either reconsidering your "key obfuscation" scheme or significantly expanding the stored key-size to reduce the (currently excessive) probability of key collision.

From Wikipedia: http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem

The birthday problem in this more generic sense applies to hash functions: the expected number of N-bit hashes that can be generated before getting a collision is not 2^N, but rather only 2^(N/2).

Since in the most conservative above understanding of your design (reading it as A, 8 chars of hex == 32 bits) your scheme would be expected to suffer collisions if it stored on the scale of ~64,000 rows. I would consider such an outcome unacceptable for all serious, or even toy, systems.

Transaction tables may have volumes, allowing growth for the business, from 1000 - 100,000 transactions/day (or more). Systems should be designed to function 100 years (36500 days), with a 10x growth factor built in, so..

For your keying mechanism to be genuinely robust & professionally useful, you would need to be able to scale it up to potentially handle ~36 billion (2^35) rows without collision. That would imply 70+ bits of hash.

The source-control system Git, for example, stores 160 bits of SHA-1 hash (40 chars of hex == 20 bytes or 160 bits). Collisions would not be expected to be probable with < less than 2^80 different file revisions stored.


A possibility better design might be, rather than hashing & pseudo-randomizing the key entirely & hoping (against hope) to avoid collisions, to prepend/ append/ fold-in 8-10 bits of a hash into the key.

This would generates a larger key, containing all the uniqueness of the original key plus 8-10 bits of verification. Attempts to access keys would then be verified, and more than 3 invalid requests would be treated as an attempt to violate security by "probing" the keyspace & would trigger semi-permanent lockout.

The only major costs here, would be a modest reduction in the size of the available keyspace for a given int-size. 32-bit int to/from the browser would have 8-10 bits dedicated to security, thus leaving 22-24 for the actual key. So you'd use 64-bit ints where that was not sufficient.

like image 112
Thomas W Avatar answered Sep 26 '22 03:09

Thomas W