Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for a good 64 bit hash for file paths in UTF16

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'. The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"

I want to hash this path into a 64-bit unsigned integer. It does not need to be "cryptographically sound". The hashes should be case insensitive, but able to handle non-ascii letters. Obviously, the hash also should scatter well.

There are some ideas that I had though of:

A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.

B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.

I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.

To sum up the needs:

1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive

like image 765
Dominik Weber Avatar asked Sep 15 '10 20:09

Dominik Weber


People also ask

What is the fastest hashing algorithm?

SHA-1 is fastest hashing function with ~587.9 ms per 1M operations for short strings and 881.7 ms per 1M for longer strings. MD5 is 7.6% slower than SHA-1 for short strings and 1.3% for longer strings. SHA-256 is 15.5% slower than SHA-1 for short strings and 23.4% for longer strings.

How long does it take to hash a large file?

It generally takes 3-4 hours to transfer via NC and then 40 minutes to get the md5sum. The security of the hash is not an issue in this case.

How long does it take to hash a 1GB file?

We have tried to improve the performance of Hash calculation for 1GB file using SHA256/MD5 algorithm , it takes nearly 1 minute and 20 secs(01:20.2187500) to generate the hash .

How long does it take to create a hash?

The weight you'll administer as you walk, along with your body heat, will help press the hash into a slab. This method will take at least 15 minutes and up to an hour to complete.


1 Answers

I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:

ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
  ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
  res = res * 8191 + merge; // unchecked arithmetic
}
return res;

I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.

I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.

like image 198
Jon Hanna Avatar answered Oct 26 '22 17:10

Jon Hanna