1) For the purpose of really low hash collision, can I get away with just using half of the 128 bits of a sha1 rather than dealing with the sha1 itself? I understand this is not suitable for cryptographic hashes, but I just need the hashes for hash table keys.
2) Computation time isn't a priority, and besides which I'm hashing very small pieces of data. In particular, I'm mostly going to be taking 2 or 3 64-bit hashes and hashing them to get another 64-bit hash. Is there a better option than sha1 for this purpose? Again, collisions should be very unlikely.
3) I'm a sql newb. Is it a good idea to use 64-bit hashes as id's in sql? Will 64-bit id's cause performance problems in sqlite or postgres? I'm going to need to coordinate data across multiple databases (including a Lucene index), so I figured I should deal with hashes directly in the tables rather than bothering with auto-incremented ids (which would only be meaningful in one db, not across all data stores). I figure 64-bit is a good compromise: big enough for unlikely collisions but saves on space (and lookup time?).
4) What about CRC-64? Does that produce a random enough distribution?
If you have few enough records it's almost certain that you'll never have a hash collision in 64 bits. Likely you will fall into this category.
There should be no problem with trimming down a cryptographic hash like sha1, because if there were internal structure in the hash then it wouldn't be good enough to be a crypto hash, and if there's no structure then any subset of the bits should be quite random. Note that I'm only talking about using that for IDs, not for any crypto purposes!
But really, doesn't your SQL have some kind of GUID? And if it does, why not use it?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With