Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much can you truncate a SHA1 hash and be reasonably sure of having an unique ID?

Tags:

I am making an application that stores documents and gives each one a UID based on a SHA1 digest of a few things including the timestamp. The digest has a lot of characters, and I want to allow users to identify the documents by using the first x characters of the full digest. What's a good value for x if the number of documents is maybe around 10K - 100K?

like image 968
dan Avatar asked Jan 24 '11 16:01

dan


People also ask

What is the length of a SHA1 hash?

The hash size for the SHA1 algorithm is 160 bits.

Is it safe to truncate a hash?

As far as truncating a hash goes, that's fine. It's explicitly endorsed by the NIST, and there are hash functions in the SHA-2 family that are simple truncated variants of their full brethren: SHA-256/224, SHA-512/224, SHA-512/256, and SHA-512/384, where SHA-x/y denotes a full-length SHA-x truncated to y bits.

Is SHA-1 always unique?

Yes it is possible because of the pigeon hole principle. Most hashes (also sha1) have a fixed output length, while the input is of arbitrary size. So if you try long enough, you can find them. However, cryptographic hash functions (like the sha-family, the md-family, etc) are designed to minimize such collisions.

How many unique hashes are there?

Therefore, there are infinitely many possible data that can be hashed. Note the definition of a hash above which states that a hash is always fixed-length. For example, the MD5 hash is always 128 bits long (commonly represented as 16 hexadecimal bytes). Thus, there are 2^128 possible MD5 hashes.


1 Answers

Adapting the formulas on on wikipedia for the Birthday problem, you can approximate the probability of collision as 1 - e^(-n^2/(2^(b+1))), where n is the document count and b is the number of bits. Graphing this formula with n=100,000, it looks like you'll want b > 45 at least. I'd be more inclined to go with 64 to make it a nice and round number. That said, do have a plan to deal with collisions if they occur (maybe alter the timestamp slightly, or add a nonce?)

For that matter, if the sha1 is based on more than just the content of the document, why not simply make it a random ID? In this case collisions are less of a problem, as you can always generate a new random number and try again (the probability of a collision with a single try is the same, however).

like image 173
bdonlan Avatar answered Sep 19 '22 23:09

bdonlan