Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to hash strings into a float in [0:1]?

I have a dataset with multiple strings. I want to associate each of these strings to a float, "randomly" distributed in the [0:1] range. Examples:

>>> myfunction(string_1)
0.26756754
>>> myfunction(string_2)
0.86764534

random does not fulfill my need because it does not take any string as input/deterministic parameter. I am looking for something more like a hash function.

like image 495
Vincent Avatar asked Oct 31 '16 22:10

Vincent


1 Answers

A fast and portable solution:

from zlib import crc32

def bytes_to_float(b):
    return float(crc32(b) & 0xffffffff) / 2**32

This converts a bytes string to a float between 0.0 and 1.0. If you are using a unicode string (e.g., in python 3), then you need to encode it:

def str_to_float(s, encoding="utf-8"):
    return bytes_to_float(s.encode(encoding))

Example

>>> str_to_float(u"café")
0.5963937465567142

This should give the same result on any machine and any version of python (tested on python 2.7 and 3.5).

Note: the & 0xffffffff is here to guarantee an unsigned int result. It is needed because depending on the python version crc32(b) may return a signed or unsigned int.

Edit

If you want something more "random" than a CRC32, you can use a hash function instead, such as SHA256:

from struct import unpack
from hashlib import sha256

def bytes_to_float(b):
    return float(unpack('L', sha256(b).digest()[:8])[0]) / 2**64

Performance test

            String length
Function    7       70      700     7000
b2f_crc32   0.34    0.38    0.87    5.59    
b2f_md5     0.96    1.08    2.11    11.13   
b2f_sha1    0.99    1.07    1.76    8.37    
b2f_sha256  1.11    1.20    2.60    16.44   
b2f_rnd     6.59    6.55    6.59    6.60    

Basically, the CRC32 solution is by far the fastest for short strings (18x faster than @user3030010's random=RND solution). It is roughly 3x faster than SHA256, no matter the string length. SHA256 is slower than MD5, which is slower than SHA1 (except for very short strings). However, the RND option does not depend on the string length, so when the strings are very long, it can be the fastest option (but see my comments on @user3030010's answer): on my computer, it beats SHA256 for strings longer than 2500 characters, and it beats CRC32 for strings longer than 8000 characters.

Here's the code, using timeit.timeit():

from __future__ import print_function
[...] # define b2f_crc32, b2f_md5 and so on.
for func in ("b2f_crc32", "b2f_md5", "b2f_sha1", "b2f_sha256", "b2f_rnd"):
  for length in (7, 70, 700, 7000):
    t = timeit('b2f(b"%s")'%(b"x"*length),
               'from __main__ import %s as b2f' % func)
    print("%.2f"%t, end="\t")
  print()
like image 167
MiniQuark Avatar answered Sep 28 '22 02:09

MiniQuark