I have several billion strings in the format word0.word1.word2, and I wish to perform modulo n on those strings so that I can feed each to a database writer for storage. I know I can perform a form a modulo 10 on the first character of the strings like this:
for i in ["a.b","c.d"]:
print ord(i[0]) % 10
This won't divide my strings evenly, though, as word0, word1, and word2 are sorted into alphabetical order, and the first character of the string is very often "a". I could take the last letter of the string, but am not sure if those are normally distributed or not.
My question: Is there a fast way to perform something like "ord" on the entire string? I ultimately plan to run modulo 48 on the integer representations of the strings, and wish for that modular output to be uniformly distributed across all 48 cores. I would be grateful for any help others can offer.
s = "whatever" # have a string
h = hash(s) # obtain its hash
bin = h % 48 # find the bin
Update: The Python's built-in hash
function provides deterministic values only for a single process. If you want to keep this information (directly or indirectly ) in a database you have to use an explicit hash function that doesn't include any random data. (Credit goes to @Alik)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With