I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'. The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys" I want to hash this path into a 64-bit unsigned integer. It does not need to be "cryptographically sound". The hashes should be case insensitive, but able to handle non-ascii letters. Obviously, the hash also should scatter well. There are some ideas that I had though of: A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option. B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string. I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged. To sum up the needs: 1) 64bit hash 2) good distribution / few collisions for file system paths. 3) efficient 4) does not need to be secure 5) case insensitive

I would just use something straightforward. I don't know what language you are using, so the following is pseudocode: <pre class="prettyprint"><code>ui64 res = 10000019; for(i = 0; i < len; i += 2) { ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]); res = res * 8191 + merge; // unchecked arithmetic } return res; </code></pre> I'm assuming that <code>path[i + 1]</code> is safe on the basis that if <code>len</code> is odd then in the last case it will read the U+0000 safely. I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.

Looking for a good 64 bit hash for file paths in UTF16

Tags:

path

hash

utf-16

hash-collision

collision

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'. The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"

I want to hash this path into a 64-bit unsigned integer. It does not need to be "cryptographically sound". The hashes should be case insensitive, but able to handle non-ascii letters. Obviously, the hash also should scatter well.

There are some ideas that I had though of:

A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.

B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.

I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.

To sum up the needs:

1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive

765

asked Sep 15 '10 20:09

Dominik Weber

1 Answers

I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:

ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
  ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
  res = res * 8191 + merge; // unchecked arithmetic
}
return res;

I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.

I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.

198

answered Oct 26 '22 17:10

Jon Hanna

Related questions
                            
                                What are some of the best hashing algorithms to use for data integrity and deduplication?
                            
                                Python __hash__ for equal value objects
                            
                                Hash integer array
                            
                                Ruby on Rails sneakily changing nested hash keys from symbols to strings
                            
                                How to check if content of webpage has been changed?
                            
                                Question About Hash Binding in EVAL in Raku
                            
                                How does Gmail handle back/forward in rich JavaScript?
                            
                                Easy to remember fingerprints for data?
                            
                                Hash randomization in Perl 5
                            
                                Obj-C MD5 Hash not matching Java / SQL
                            
                                Password Hash via Rfc2898DeriveBytes - what to pass to getBytes
                            
                                Generate a pseudo random 6 character string from an integer
                            
                                Specializing std::hash to derived classes
                            
                                How to get hash value in user.config path?
                            
                                Construct Hash in Jekyll/Liquid
                            
                                Fastest Hash algorithm in Java for Strings
                            
                                Rails using Symbol vs String as key in params hash
                            
                                Can I associate a CODE reference with a HASH reference that contains it in Perl?
                            
                                ASP.NET Identity 2.0: How to rehash password
                            
                                Perl: Threading with shared multi-dimensional hash

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With