Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast hash for strings

Tags:

I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.

I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:

https://www.youtube.com/watch?v=-F-3E8pyjFo                                 ^^^^^^^^^^^ 

MD5 seems to be what I need, but it is critical for me to have a short hash strings.

Is there a shell command or python library which can do that?

like image 775
Antonio Avatar asked Feb 24 '14 22:02

Antonio


People also ask

Which string hashing is best?

If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. it has excellent distribution and speed on many different sets of keys and table sizes. you are not likely to do better with one of the "well known" functions such as PJW, K&R[1], etc.

Which hash function is fastest?

SHA-1 is fastest hashing function with ~587.9 ms per 1M operations for short strings and 881.7 ms per 1M for longer strings. MD5 is 7.6% slower than SHA-1 for short strings and 1.3% for longer strings.

Can you hash a string?

The process of hashing in cryptography is to map any string of any given length, to a string with a fixed length. This smaller, fixed length string is known as a hash. To create a hash from a string, the string must be passed into a hash function.

How do you hash a string in C++?

using c++11, you can: #include <string> #include <unordered_map> std::size_t h1 = std::hash<std::string>{}("MyString"); std::size_t h2 = std::hash<double>{}(3.14159);


1 Answers

Python has a built-in hash() function that's very fast and perfect for most uses:

>>> hash("dfds") 3591916071403198536 

You can then make it unsigned:

>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value 

You can then turn it into a 16 byte hex string:

>>> hashu("dfds").to_bytes(8,"big").hex() 

Or an N*2 byte string, where N is <= 8:

>>> hashn=lambda word, N  : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex() 

..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.

>>> hashnbig=lambda word, N  : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex() 

And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you

>>> hashnbigu=lambda word, N  : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=") >>> hashnbigu("foo",16) 'ZblnvrRqHwAy2lnvrR4HrA' 

Caveats:

  • Be warned that in Python 3.3 and up, this function is randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0

  • See https://github.com/flier/pyfasthash for fast, stable hashes that won't break your CPU for non-cryptographic applications.

  • Don't use this lambda style in real code... write it out! And stuffing things like 2**32 in your code, instead of making them constants is bad form.

  • In the end 8 bytes of collision resistance is OK for a smaller applications.... with less than a million entries, you've got collision odds of < 0.0000001%. That's a 12 byte b64 encoded string. But it might not be enough for larger apps.

  • 16 bytes is enough for a UUID/OID in a cache, etc.

like image 92
Erik Aronesty Avatar answered Oct 05 '22 08:10

Erik Aronesty