I have a database-driven web application where the primary keys of all data rows are obfuscated as follows: SHA256(content type + primary key + secret), truncated to the first 8 characters. The content type is a simple word, e.g. "post" or "message" and the secret is a 20-30 char ASCII constant. The result is stored in a separate indexed column for fast DB lookup.
How do I calculate the probability of a hash collision in this scenario? I am not a mathematician at all, but a friend claimed that due to the Birthday Paradox the collision probability would be ~1% for 10,000 rows with an 8-char truncation. Is there any truth to this claim?
SHA256: The slowest, usually 60% slower than md5, and the longest generated hash (32 bytes). The probability of just two hashes accidentally colliding is approximately: 4.3*10-60. As you can see, the slower and longer the hash is, the more reliable it is.
Like Tom just said, you can truncate SHA-256 output to 128 bits for integrity, because 128 bits is enough to reasonably avoid collisions. However, hash functions like SHA-256 are not suited for key generation or file authenticity (only integrity). If you want to generate a key, use something like PKBDF2 or scrypt.
In [8] Li et al. have shown how a particular preimage attack can be used to construct a free-start collision attack for 52 steps of SHA-256. All attacks are only slightly faster than the generic attack which has a complexity of 2256 for preimages and 2128 for free- start collisions.
in this scenario sha256-based cryptocurrencies will be worthless. in general: every cryptocurrency and every encryption-system will be worthless when the underlying algorithm (sha2, sha3, aes, ripemd160, whatever) is "broken" by a quantum commputer.
Yes, there is a collision probability & it's probably somewhat too high. The exact probability depends on what "8 characters" means.
Does "8 characters" mean:
Storing binary data as either A) hex or D) binary bytes, would be my preferred options. But I'd definitely recommend either reconsidering your "key obfuscation" scheme or significantly expanding the stored key-size to reduce the (currently excessive) probability of key collision.
From Wikipedia: http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem
The birthday problem in this more generic sense applies to hash functions: the expected number of N-bit hashes that can be generated before getting a collision is not 2^N, but rather only 2^(N/2).
Since in the most conservative above understanding of your design (reading it as A, 8 chars of hex == 32 bits) your scheme would be expected to suffer collisions if it stored on the scale of ~64,000 rows. I would consider such an outcome unacceptable for all serious, or even toy, systems.
Transaction tables may have volumes, allowing growth for the business, from 1000 - 100,000 transactions/day (or more). Systems should be designed to function 100 years (36500 days), with a 10x growth factor built in, so..
For your keying mechanism to be genuinely robust & professionally useful, you would need to be able to scale it up to potentially handle ~36 billion (2^35) rows without collision. That would imply 70+ bits of hash.
The source-control system Git, for example, stores 160 bits of SHA-1 hash (40 chars of hex == 20 bytes or 160 bits). Collisions would not be expected to be probable with < less than 2^80 different file revisions stored.
A possibility better design might be, rather than hashing & pseudo-randomizing the key entirely & hoping (against hope) to avoid collisions, to prepend/ append/ fold-in 8-10 bits of a hash into the key.
This would generates a larger key, containing all the uniqueness of the original key plus 8-10 bits of verification. Attempts to access keys would then be verified, and more than 3 invalid requests would be treated as an attempt to violate security by "probing" the keyspace & would trigger semi-permanent lockout.
The only major costs here, would be a modest reduction in the size of the available keyspace for a given int-size. 32-bit int to/from the browser would have 8-10 bits dedicated to security, thus leaving 22-24 for the actual key. So you'd use 64-bit ints where that was not sufficient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With