I am making an application that stores documents and gives each one a UID based on a SHA1 digest of a few things including the timestamp. The digest has a lot of characters, and I want to allow users to identify the documents by using the first x characters of the full digest. What's a good value for x if the number of documents is maybe around 10K - 100K?
The hash size for the SHA1 algorithm is 160 bits.
As far as truncating a hash goes, that's fine. It's explicitly endorsed by the NIST, and there are hash functions in the SHA-2 family that are simple truncated variants of their full brethren: SHA-256/224, SHA-512/224, SHA-512/256, and SHA-512/384, where SHA-x/y denotes a full-length SHA-x truncated to y bits.
Yes it is possible because of the pigeon hole principle. Most hashes (also sha1) have a fixed output length, while the input is of arbitrary size. So if you try long enough, you can find them. However, cryptographic hash functions (like the sha-family, the md-family, etc) are designed to minimize such collisions.
Therefore, there are infinitely many possible data that can be hashed. Note the definition of a hash above which states that a hash is always fixed-length. For example, the MD5 hash is always 128 bits long (commonly represented as 16 hexadecimal bytes). Thus, there are 2^128 possible MD5 hashes.
Adapting the formulas on on wikipedia for the Birthday problem, you can approximate the probability of collision as 1 - e^(-n^2/(2^(b+1)))
, where n
is the document count and b
is the number of bits. Graphing this formula with n=100,000, it looks like you'll want b > 45 at least. I'd be more inclined to go with 64 to make it a nice and round number. That said, do have a plan to deal with collisions if they occur (maybe alter the timestamp slightly, or add a nonce?)
For that matter, if the sha1 is based on more than just the content of the document, why not simply make it a random ID? In this case collisions are less of a problem, as you can always generate a new random number and try again (the probability of a collision with a single try is the same, however).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With