Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking if a string hash contains a substring hash

Tags:

string

c#

hash

Let's say I have a ton of documents that I somehow hash (e.g. Sha256) and store their hashes. Is there a hashing technique that would allow me checking if string1 is contained within string2, by just looking at their hashes? I want to avoid loading the full text.

To clarify: this is unrelated to sim/min-hashing, looking for near-duplicates or Levenshtein distance. I am looking for a hashing algorithm that would somehow allow me to check for substrings, by just looking at the hashes.

e.g.

var string1 = "bla bla bla cat dog bla bla";
var string2 = "cat dog";
var hash1 = HashAlgo(string1); // <-- magic goes here
var hash2 = HashAlgo(string2);
Assert.IsTrue(string1.Contains(string2));
Assert.IsTrue(hash1.Contains(hash2)); // <--- magic goes here
like image 658
hyankov Avatar asked Mar 01 '26 07:03

hyankov


1 Answers

If you think about it, it wouldn't make sense that this is possible.

First of all, all SHA256 hashes have the exact same length. I've based the answer on SHA256 but as far as I'm aware this applies to any hashing method.

  • Consider a 1000-character document, which youve SHA256-hashed. Its hash is 64 digits long.
  • Consider a 100-character document, which you've SHA256-hashed. Its hash is 64 digits long. The content of this document happens to be the first chapter of the larger document.
  • Consider a second 100-character document, which you've SHA256-hashed. Its hash is 64 digits long. The content of this document happens to be the second chapter of the larger document.

It's impossible for the larger file's hash to contain both of the smaller files' hashes, as that would only be possible if all three hashes were equal to each other.

Secondly, think of how many 100 character substrings I could take from a 1000-character document. It's not just 10 (as in 1000/100 = 10), but rather 900. Denoting the substrings as index boundaries, there are many possibilities:

  • 0 to 100
  • 1 to 101
  • 2 to 102
  • ...
  • 897 to 997
  • 898 to 998
  • 899 to 999

There are 900 options in total. Assuming that your initial document does not repeat itself in any way (so you don't get two equal substrings), this will lead to 900 (presumed) unique hashes.

These 900 unique hashes can't all be a substring of the initial file's hash.

Furthermore, consider that we haven't even thought about substrings of other lengths! Assuming any possible substring length, you can end up with 999,000 different substrings (but of course some of these will have duplicates)

And that's not even thinking about the fact that the original document could be well over 1000 characters long. For any document with n characters, you can expect to find n*(n-1) substrings (with a length between 1 and n), with predominantly unique hash values.

This expansion of possible values only plateaus once you're in the order of magnitude of 1077 (more precisely, 2256), as this is how many unique SHA hashes can possibly exist.
Back of the napkin, that would be a document with 1038 bytes. Once you get to that filesize, all possible substrings (of any length) will have to contain at least one duplicate.

I think you can see why your suggestion is simply not mathematically possible.

I will keep this as a sidenote, but superpermutations are a tangential topic worth looking at to understand how impossible this is. For 7 unique characters, you need a superpermutation of 5907 digits if you want to encompass all possible permutations of the 7 characters. This is the highest N for which we have found (minimal) superpermutations.

For the initial example of 900 unique hashes (= unique permutations of hexedecimal characters) which would all be contained in your "master" hash, the minimum required length of the master hash is simply incalculable. But as an absolute minimum (which you provably cannot go under), your master hash would have to be 963 characters long (if you assume that every single 64-character substring always gives you a unique new hash)

like image 116
Flater Avatar answered Mar 03 '26 21:03

Flater



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!