I have a dataset with multiple strings. I want to associate each of these strings to a float, "randomly" distributed in the [0:1]
range. Examples:
>>> myfunction(string_1)
0.26756754
>>> myfunction(string_2)
0.86764534
random
does not fulfill my need because it does not take any string as input/deterministic parameter. I am looking for something more like a hash function.
A fast and portable solution:
from zlib import crc32
def bytes_to_float(b):
return float(crc32(b) & 0xffffffff) / 2**32
This converts a bytes string to a float between 0.0 and 1.0. If you are using a unicode string (e.g., in python 3), then you need to encode it:
def str_to_float(s, encoding="utf-8"):
return bytes_to_float(s.encode(encoding))
Example
>>> str_to_float(u"café")
0.5963937465567142
This should give the same result on any machine and any version of python (tested on python 2.7 and 3.5).
Note: the & 0xffffffff
is here to guarantee an unsigned int result. It is needed because depending on the python version crc32(b)
may return a signed or unsigned int.
Edit
If you want something more "random" than a CRC32, you can use a hash function instead, such as SHA256:
from struct import unpack
from hashlib import sha256
def bytes_to_float(b):
return float(unpack('L', sha256(b).digest()[:8])[0]) / 2**64
Performance test
String length
Function 7 70 700 7000
b2f_crc32 0.34 0.38 0.87 5.59
b2f_md5 0.96 1.08 2.11 11.13
b2f_sha1 0.99 1.07 1.76 8.37
b2f_sha256 1.11 1.20 2.60 16.44
b2f_rnd 6.59 6.55 6.59 6.60
Basically, the CRC32 solution is by far the fastest for short strings (18x faster than @user3030010's random=RND solution). It is roughly 3x faster than SHA256, no matter the string length. SHA256 is slower than MD5, which is slower than SHA1 (except for very short strings). However, the RND option does not depend on the string length, so when the strings are very long, it can be the fastest option (but see my comments on @user3030010's answer): on my computer, it beats SHA256 for strings longer than 2500 characters, and it beats CRC32 for strings longer than 8000 characters.
Here's the code, using timeit.timeit()
:
from __future__ import print_function
[...] # define b2f_crc32, b2f_md5 and so on.
for func in ("b2f_crc32", "b2f_md5", "b2f_sha1", "b2f_sha256", "b2f_rnd"):
for length in (7, 70, 700, 7000):
t = timeit('b2f(b"%s")'%(b"x"*length),
'from __main__ import %s as b2f' % func)
print("%.2f"%t, end="\t")
print()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With