I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long. I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids: <pre class="prettyprint"><code>https://www.youtube.com/watch?v=-F-3E8pyjFo ^^^^^^^^^^^ </code></pre> MD5 seems to be what I need, but it is critical for me to have a short hash strings. Is there a shell command or python library which can do that?

Python has a built-in hash() function that's very fast and perfect for most uses: <pre class="prettyprint"><code>>>> hash("dfds") 3591916071403198536 </code></pre> You can then make it unsigned: <pre class="prettyprint"><code>>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value </code></pre> You can then turn it into a 16 byte hex string: <pre class="prettyprint"><code>>>> hashu("dfds").to_bytes(8,"big").hex() </code></pre> Or an N*2 byte string, where N is <= 8: <pre class="prettyprint"><code>>>> hashn=lambda word, N : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex() </code></pre> ..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance. <pre class="prettyprint"><code>>>> hashnbig=lambda word, N : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex() </code></pre> And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you <pre class="prettyprint"><code>>>> hashnbigu=lambda word, N : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=") >>> hashnbigu("foo",16) 'ZblnvrRqHwAy2lnvrR4HrA' </code></pre> Caveats: <ul> <li> Be warned that in Python 3.3 and up, this function is randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0 </li> <li> See https://github.com/flier/pyfasthash for fast, stable hashes that won't break your CPU for non-cryptographic applications. </li> <li> Don't use this lambda style in real code... write it out! And stuffing things like 2**32 in your code, instead of making them constants is bad form. </li> <li> In the end 8 bytes of collision resistance is OK for a smaller applications.... with less than a million entries, you've got collision odds of < 0.0000001%. That's a 12 byte b64 encoded string. But it might not be enough for larger apps. </li> <li> 16 bytes is enough for a UUID/OID in a cache, etc. </li> </ul>

Fast hash for strings

Tags:

I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.

I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:

https://www.youtube.com/watch?v=-F-3E8pyjFo                                 ^^^^^^^^^^^

MD5 seems to be what I need, but it is critical for me to have a short hash strings.

Is there a shell command or python library which can do that?

775

asked Feb 24 '14 22:02

Antonio

1 Answers

Python has a built-in hash() function that's very fast and perfect for most uses:

>>> hash("dfds") 3591916071403198536

You can then make it unsigned:

>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value

You can then turn it into a 16 byte hex string:

>>> hashu("dfds").to_bytes(8,"big").hex()

Or an N*2 byte string, where N is <= 8:

>>> hashn=lambda word, N  : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()

..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.

>>> hashnbig=lambda word, N  : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()

And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you

>>> hashnbigu=lambda word, N  : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=") >>> hashnbigu("foo",16) 'ZblnvrRqHwAy2lnvrR4HrA'

Caveats:

Be warned that in Python 3.3 and up, this function is randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0
See https://github.com/flier/pyfasthash for fast, stable hashes that won't break your CPU for non-cryptographic applications.
Don't use this lambda style in real code... write it out! And stuffing things like 2**32 in your code, instead of making them constants is bad form.
In the end 8 bytes of collision resistance is OK for a smaller applications.... with less than a million entries, you've got collision odds of < 0.0000001%. That's a 12 byte b64 encoded string. But it might not be enough for larger apps.
16 bytes is enough for a UUID/OID in a cache, etc.

answered Oct 05 '22 08:10

Erik Aronesty

Related questions
                            
                                How am I supposed to pass data from serverside controller an AngularJS controller?
                            
                                How to avoid Qt app.exec() blocking main thread
                            
                                How do you loop through the fields in a Golang struct to get and set values in an extensible way?
                            
                                Go using timeouts with channels
                            
                                Python Pandas - Using to_sql to write large data frames in chunks
                            
                                ansible: Is there something like with_fileglobs for files on remote machine?
                            
                                Are Swift "mutable" strings really mutable, or are they just like Java strings?
                            
                                How to prohibit the use of global variables on compile time
                            
                                Get Instance Of ViewController From AppDelegate In Swift
                            
                                R RJDBC java.lang.OutOfMemoryError
                            
                                Custom syntax highlighting in JetBrains IDE's
                            
                                <class 'requests.models.Response'> to Json

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With