I have a java applciation in which I want to generate long
ids for strings (in order to store those strings in neo4j). In order to avoid data duplication, I would like to generate an id for each string stored in a long
integer, which should be unique for each string. How can I do that ?
In order to create a unique hash from a specific string, it can be implemented using their own string to hash converting function. It will return the hash equivalent of a string. Also, a library named Crypto can be used to generate various types of hashes like SHA1, MD5, SHA256 and many more.
A hash function is a function that takes input of a variable length sequence of bytes and converts it to a fixed length sequence. It is a one way function. This means if f is the hashing function, calculating f(x) is pretty fast and simple, but trying to obtain x again will take years.
If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. it has excellent distribution and speed on many different sets of keys and table sizes. you are not likely to do better with one of the "well known" functions such as PJW, K&R[1], etc. Also see tpop pp.
This code will calculate pretty good hash:
String s = "some string";
long hash = UUID.nameUUIDFromBytes(s.getBytes()).getMostSignificantBits();
Why don't you have a look a the hashcode()
function of String, and just adopt it to using long values instead?
Btw. if there was a way to create a unique ID for each String, then you would have found a compression algorithm that would be able to pack every String into 8 bytes (not possible by definition).
long
has 64 bits. A String
of length 9 has 72 bits. from pigeon hole principle - you cannot get a unique hashing for 9 chars long strings to a long
.
If you still want a long
hash: You can just take two standard [different!] hash functions for String->int
, hash1()
and hash2()
and calculate: hash(s) = 2^32* hash1(s) + hash2(s)
There are many answers, try the following:
long
requirement. Mea culpa.Or, as suggested before, check out the sources.
PS. One more technique is to maintain a dictionary of strings: since you're unlikely to get 264 strings any time soon, you can have perfect mapping. Note though that that mapping may as well become a major bottleneck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With