I have sorted lists of filenames concatenated to strings and want to identify each such string by a unique checksum.
The size of these strings is a minimum of 100 bytes, a maximum of 4000 bytes, and an average of 1000 bytes. The total number of strings could be anything but more likely be in the range of ca. 10000.
Is CRC-32 suited for this purpose?
E.g. I need each of the following strings to have a different fixed-length (, preferably short,) checksum:
"/some/path/to/something/some/other/path"
"/some/path/to/something/another/path"
"/some/path"
...
# these strings can get __very__ long (very long strings are the norm)
Is the uniqueness of CRC-32 hashes increased by input length?
Is there a better choice of checksum for this purpose?
No.
Unless your filenames were all four characters or less, there is no assurance that the CRCs will be unique. With 10,000 names, the probability of at least two of them having the same CRC is about 1%.
This would be true for any 32-bit hash value.
The best way to assign a unique code to each name is to simply start a counter at zero for the first name, and increment for each name, assigning the counter as the code for that name. However that won't help you compute the code given just the name.
You can use a hash, such as a CRC or some other hash, but you will need to deal with collisions. There are several common approaches in the literature. You would keep a list of hashes with names assigned, and if you have a collision you could just increment the hash until you find one not used and assign that one. Then when looking up a name, you start at the computed hash and do a linear search for the name until you find it or an unused slot.
As for the hash, I would recommend XXH64. It is a very fast 64-bit hash. You do not need a cryptographic hash for this application, which would be unnecessarily slow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With