Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Very low collision non-cryptographic hashing function

I'm writing an application that uses hashing to speed up file comparisons. Basically I pre-hash file A, and then the app runs and matches files in a folder with previously hashed files. My current criteria for looking for a hash function are as follows:

  • It should be fast enough that disk IO is the limiting factor. I'm currently using SHA-256 which works just fine but is way too heavy and makes my application CPU bound.
  • Cryptography/security doesn't matter in this case, the user is inputting both files, so if they craft a hash collision intentionally, that's on them.
  • Hash collisions should be avoided at almost all costs. I can compare files based on size, and their hash, but if both of those match the files are assumed to be equal. I know it's impossible guarantee this with any hash due to the compression of data, but something with the same sort of uniqueness guarantees as SHA-256 would be nice.
  • File sizes range from 10bytes to 2GB
  • A streaming algorithm would be nice, as I try to keep the memory usage of the application low, in other words I don't want to load the entire file into memory to hash it.
  • Hash size doesn't matter, if I got all the above with 1024bit hashes, I'm completely okay with that.

So what's a good algorithm to use here, I'm using C# but I'm sure most algorithms are available on any platform. Like I said, I'm using SHA-256, but I'm sure there's something better.

like image 472
Timothy Baldridge Avatar asked Aug 12 '19 03:08

Timothy Baldridge


People also ask

Which type of hashing can suffer from no collision?

Yes they are called Perfect hash functions on wiki iv also seen them being called collision free hash functions.

Which of these is an example of a non-cryptographic hash function?

An example of a cryptographic hash function is SHA256. An example of a non-cryptographic hash function is CRC32.

What is the slowest hashing function?

I believe bcrypt is the slowest hashing algorithm currently available and is why it is most commonly recommended for hashing passwords.


1 Answers

Yann Collet's xxHash may be a good choice (Home page, GitHub)

xxHash is an extremely fast non-cryptographic hash algorithm, working at speeds close to RAM limits. It is proposed in two flavors, 32 and 64 bits.

At least 4 C# impelmentations are available (see home page).

I had excellent results with it in the past.

The Hash size is 32 or 64 bit, but XXH3 is in the making:

XXH3 features a wide internal state of 512 bits, which makes it suitable to generate a hash of up to 256 bit. For the time being, only 64-bit and 128-bit variants are exposed, but a similar recipe can be used for a 256-bit variant if there is any need for it one day. All variant feature same speed, since only the finalization stage is different.

In general, the longer the hash, the slower its calculation. 64-bit hash is good enough for most practical purposes.

You can generate longer hashes by combining two hash functions (e.g. 128-bit XXH3 and 128-bit MurmurHash3).

like image 191
Lior Kogan Avatar answered Oct 21 '22 08:10

Lior Kogan