Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate a unique hash code for an object, based on its contents?

Tags:

I need to generate a unique hash code for an object, based on its contents, e.g. DateTime(2011,06,04) should equal DateTime(2011,06,04).

  • I cannot use .GetHashCode() because it might generate the same hash code for objects with different contents.
  • I cannot use .GetID from ObjectIDGenerator as it generates a different hash code for objects with the same contents.
  • If the object contains other sub-objects, it needs to recursively check these.
  • It needs to work on collections.

The reason I need to write this? I'm writing a caching layer using PostSharp.

Update

I think I may have been asking the wrong question. As Jon Skeet pointed out, to be on the safe side, I need as many unique combinations in the cache key as there are combinations of potential data in the object. Therefore, the best solution might be to build up a long string that encodes the public properties for the object, using reflection. The objects are not too large so this is very quick and efficient:

  • Its efficient to construct the cache key (just convert the public properties of the object into a big string).
  • Its efficient to check for a cache hit (compare two strings).
like image 415
Contango Avatar asked Apr 06 '11 16:04

Contango


2 Answers

If you need to create a unique hash code, then you're basically talking about a number which can represent as many states as your type can have. For DateTime than means taking the Ticks value and the DateTimeKind, I believe.

You may be able to get away with assuming that the top two bits of the Ticks property are going to be zero, and using those to store the kind. That means you're okay up until the year 7307 as far as I can tell:

private static ulong Hash(DateTime when)
{
    ulong kind = (ulong) (int) when.Kind;
    return (kind << 62) | (ulong) when.Ticks;
}
like image 29
Jon Skeet Avatar answered Sep 22 '22 14:09

Jon Skeet


From a comment:

I'd like something like a GUID based on the objects contents. I don't mind if there's the occasional duplicate every 10 trillion trillion trillion years or so

That seems like an unusual requirement but since that's your requirement, let's do the math.

Let's suppose you make a billion unique objects a year -- thirty per second -- for 10 trillion trillion trillion years. That's 1049 unique objects you're creating. Working out the math is quite easy; the probability of at least one hash collision in that time is above one in 1018 when the bit size of the hash is less than 384.

Therefore you'll need at least a 384 bit hash code to have the level of uniqueness that you require. That's a convenient size, being 12 int32s. If you're going to be making more than 30 objects a second or want the probability to be less than one in 1018 then more bits will be necessary.

Why do you have such stringent requirements?

Here's what I would do if I had your stated requirements. The first problem is to convert every possible datum into a self-describing sequence of bits. If you have a serialization format already, use that. If not, invent one that can serialize all possible objects that you are interested in hashing.

Then, to hash the object, serialize it into a byte array and then run the byte array through the SHA-384 or SHA-512 hashing algorithm. That will produce a professional-crypto-grade 384 or 512 bit hash that is believed to be unique even in the face of attackers trying to force collisions. That many bits should be more than enough to ensure low probability of collision in your ten trillion trillion trillion year timeframe.

like image 55
Eric Lippert Avatar answered Sep 22 '22 14:09

Eric Lippert