Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a checksum on an object graph

This question is related to this one but I think should be asked separately.

I have a complex graph of object instances. Now I would like to create a checksum on this object graph directly in memory to detect whether changes have been made to it since the last time the checksum was saved with the object graph. The checksum calculation should be quick and should not consume too much memory.

As I understand now the best solution would probably be to generate a cryptographic key on a binary serialized form of the object graph (correct me if I am wrong). But that comes with a few questions:

  1. How should I serialize the object? It must be fast and not consume too much memory. Also it must reliably always be serialized the same way. If I use the .NET default serialization can I really be sure that the created binary stream is always the same if the actual data is the same? I doubt it.
  2. So what would be an alternative way to serialize that doesn't take to long to implement?

Update:

What do you think about this approach:

  1. navigate through the graph and foreach object in the graph create a standard int hashcode using this algorithm (but exclude reference type members representing nodes in the graph). Add each hashcode to a integer list
  2. convert the integer list to a byte array
  3. create a hash on the byte array using MD5, CRC or similar

The GetHashCode algorithm mentioned should quickly calculate a hashcode that is pretty collision safe for a single object that only takes its primitive members into account. Based on this the byte array should also be a pretty collision safe representation of the object graph and the MD5/CRC hash on this too.

like image 920
bitbonk Avatar asked Mar 18 '11 17:03

bitbonk


2 Answers

Instead of Binary Serialization you could use http://code.google.com/p/protobuf-net/ and then calculate a crypto hash of it. protobuf is said to be more compact than Bin Ser (see for example http://code.google.com/p/protobuf-net/wiki/Performance ).

I'll add that, considering you don't really need to serialize. It would be better to use Reflection and "navigate" through the objects calculating your hash (in the same way the various Serializers "traverse" your object). See for example Using reflection in C# to get properties of a nested object

After much thought, and hearing what @Jon said, I can tell you that my "secondary" idea (using Reflection) is VERY VERY VERY difficult, unless you want to spend a week on writing an object parser. Yes, it's doable... But what representation would you give to the data before calculating the Hash? To be clear:

two strings
"A"
"B"

clearly "A", "B" != "AB", "". But MD5("A") combined with MD5("B") == MD5("AB") combined with MD5(""). Probably the best is to prepend the length (so using Pascal/BSTR notation)

And null values? What "serialized" value do they have? Another though question. Clearly if you serialize a string as length+string (so to solve the previous problem), you could serialize null simply as "null" (no length)... And the objects? Would you prepend an object type id? It would be surely better. Otherwise variable length objects could make the same mess as strings.

Using BinaryFormatter (or even the protobuf-net probably) you don't truly have to save somewhere the serialized object, because they both support streaming... An example

public class Hasher : Stream
{
    protected readonly HashAlgorithm HashAlgorithm;

    protected Hasher(HashAlgorithm hash)
    {
        HashAlgorithm = hash;
    }

    public static byte[] GetHash(object obj, HashAlgorithm hash)
    {
        var hasher = new Hasher(hash);

        if (obj != null)
        {
            var bf = new BinaryFormatter();
            bf.Serialize(hasher, obj);
        }
        else
        {
            hasher.Flush();
        }

        return hasher.HashAlgorithm.Hash;
    }

    public override bool CanRead
    {
        get { throw new NotImplementedException(); }
    }

    public override bool CanSeek
    {
        get { throw new NotImplementedException(); }
    }

    public override bool CanWrite
    {
        get { return true; }
    }

    public override void Flush()
    {
        HashAlgorithm.TransformFinalBlock(new byte[0], 0, 0);
    }

    public override long Length
    {
        get { throw new NotImplementedException(); }
    }

    public override long Position
    {
        get
        {
            throw new NotImplementedException();
        }
        set
        {
            throw new NotImplementedException();
        }
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        throw new NotImplementedException();
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        throw new NotImplementedException();
    }

    public override void SetLength(long value)
    {
        throw new NotImplementedException();
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        HashAlgorithm.TransformBlock(buffer, offset, count, buffer, offset);
    }
}

static void Main(string[] args)
{
    var list = new List<int>(100000000);

    for (int i = 0; i < list.Capacity; i++)
    {
        list.Add(0);
    }

    Stopwatch sw = Stopwatch.StartNew();
    var hash = Hasher.GetHash(list, new MD5CryptoServiceProvider());
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds);
}

I define a Hasher class that receives the serialization of the object (a piece at a time) and calcs the hash in "streaming mode". The memory use is O(1). The time is clearly O(n) (with n the "size" of the serialized object).

If you want to use protobuf (but be aware that for complex objects it needs them to be marked with its attributes (or with WCF attributes or...))

public static byte[] GetHash<T>(T obj, HashAlgorithm hash)
{
    var hasher = new Hasher(hash);

    if (obj != null)
    {
        ProtoBuf.Serializer.Serialize(hasher, obj);
        hasher.Flush();
    }
    else
    {
        hasher.Flush();
    }

    return hasher.HashAlgorithm.Hash;
}

The only "big" differences are that protobuf doesn't Flush the stream, so we have to do it, and that it TRULY wants that the root object be typed and not a simple "object".

Oh... and for your question:

How should I serialize the object? It must be fast and not consume too much memory. Also it must reliably always be serialized the same way. If I use the .NET default serialization can I really be sure that the created binary stream is always the same if the acutal data is the same? I doubt it.

List<int> l1 = new List<int>();

byte[] bytes1, bytes2;

using (MemoryStream ms = new MemoryStream())
{
    new BinaryFormatter().Serialize(ms, l1);
    bytes1 = ms.ToArray();
}

l1.Add(0);
l1.RemoveAt(0);

using (MemoryStream ms = new MemoryStream())
{
    new BinaryFormatter().Serialize(ms, l1);
    bytes2 = ms.ToArray();
}

Debug.Assert(bytes1.Length == bytes2.Length);

Lets say this: the Debug.Assert will fail. This because List "saves" some internal status (for example a version). This makes very difficult to Binary Serialize and compare. You would be better to use a "programmable" serializer (like proto-buf). You tell him what properties/fields to serialize and he serializes them.

So what would be an alternative way to serialize that doesn't take to long to implement?

Proto-buf... or DataContractSerializer (but it's quite slow). As you can imagine, there isn't a silver bullet to data serialization.

like image 74
xanatos Avatar answered Oct 28 '22 23:10

xanatos


What do you think about this approach:

  • navigate through the graph and foreach object in the graph create a standard int hashcode using this algorithm (but exclude reference type members representing nodes in the graph).
  • Add each hashcode to a integer list
  • Convert the integer list to a byte array
  • Create a hash on the byte array using MD5, CRC or similar

This approach idea is quite near to what I 'd consider best, but it could use some polishing.

Hashing

Considering that you would prefer speed over accuracy and that an int-sized hashcode for each item leaves plenty of room for avoiding collissions, the choice of hashcode algo seems right. Excluding reference types that participate in the graph means we 're throwing some information away; see below for more on that.

Improving the node hash

The idea of not taking into account other nodes connected to the node we are hashing is correct, but maybe we can do better than simply throwing all that information away? We don't want to take the hashcodes of other nodes into account (they will be hashed themselves as well), but we are throwing away the information provided by the graph edges here: the hashcode for a node with internal data X connected to N other nodes should not be the same for a node with data X connected to M other nodes.

If you have a cheap way of using a part of the edge data into account, use it. For example, if the graph is directed then you can add to the hashcode computed for each node the number of edges going out from it to other nodes.

Aggregating hashcodes

Creating a list of hashcodes would be the middle-ground approach between summing the hashcodes in one long (very fast and keeps some additional information over summing into an int) and creating a list of hashcodes dependent on a total order of the items in the graph. If you expect lots of items in the graph then summing might be more appropriate (I 'd try that first and see if it's collision-free enough); if the graph doesn't have many items (say < 1000) then I 'd try the total-order approach first. Remember to allocate enough memory for the list (or simply use an array) when creating it; you already know its final length so that's a free speed increase.

Producing a fixed-size hash

If you have summed the hashcodes into a primitive, this step is not required at all. Otherwise, hashing the list as a byte[] is what I 'd consider best. Since hashing the bytes will take very little time in comparison to creating the list, you may want to use a larger-sized hash function than md5 or crc32 to reduce collisions without a practical performance hit.

Improving the final hash quality

After getting this "final" hash, I 'd prepend or append to it the number of items in the hashed graph as fixed-size hex-encoded string because:

  • It might help in reducing collisions (how much depends on the nature of the graphs)
  • We already know the number of items in the graph (we just hashed each one of them) so it's an O(1) operation

Defining a total order

If the order in which the items in the graph are processed is not strictly defined, then the door is open for false negatives: two graphs which should hash to the same value do not because even though they are logically equivalent, the implementation of the hash function chose to process the per-item hashes in a different order. This problem will appear only if you use a list, since addition is transitive so the "add into a long approach" is immune to it.

To combat that, you need to process the nodes in the graph in a well-defined order. That might be an order that's easy to produce from the data structure of the nodes (e.g. like preorder traversal on a tree) and/or other information (e.g. class names or node types for each node, node ids if such exist etc).

Since preprocessing the graph to produce a total order is going to take some time, you may want to weigh that against the cost incurred by a false negative result as I mentioned above. Also, if the graphs are large enough then this discussion might be moot because of the node hashcode summation approach being more suited to your needs.

like image 24
Jon Avatar answered Oct 28 '22 23:10

Jon