Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hashing a Tree Structure

I've just come across a scenario in my project where it I need to compare different tree objects for equality with already known instances, and have considered that some sort of hashing algorithm that operates on an arbitrary tree would be very useful.

Take for example the following tree:

         O        / \       /   \      O     O     /|\    |    / | \   |   O  O  O  O           / \          /   \         O     O 

Where each O represents a node of the tree, is an arbitrary object, has has an associated hash function. So the problem reduces to: given the hash code of the nodes of tree structure, and a known structure, what is a decent algorithm for computing a (relatively) collision-free hash code for the entire tree?

A few notes on the properties of the hash function:

  • The hash function should depend on the hash code of every node within the tree as well as its position.
  • Reordering the children of a node should distinctly change the resulting hash code.
  • Reflecting any part of the tree should distinctly change the resulting hash code

If it helps, I'm using C# 4.0 here in my project, though I'm primarily looking for a theoretical solution, so pseudo-code, a description, or code in another imperative language would be fine.


UPDATE

Well, here's my own proposed solution. It has been helped much by several of the answers here.

Each node (sub-tree/leaf node) has the following hash function:

public override int GetHashCode() {     int hashCode = unchecked((this.Symbol.GetHashCode() * 31 +         this.Value.GetHashCode()));     for (int i = 0; i < this.Children.Count; i++)         hashCode = unchecked(hashCode * 31 + this.Children[i].GetHashCode());     return hashCode; } 

The nice thing about this method, as I see it, is that hash codes can be cached and only recalculated when the node or one of its descendants changes. (Thanks to vatine and Jason Orendorff for pointing this out).

Anyway, I would be grateful if people could comment on my suggested solution here - if it does the job well, then great, otherwise any possible improvements would be welcome.

like image 441
Noldorin Avatar asked Jan 01 '10 14:01

Noldorin


People also ask

What is the use of hash tree?

Hash trees can be used to verify any kind of data stored, handled and transferred in and between computers. They can help ensure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks.

How does hash function useful for Merkle tree structure?

It allows the user to verify whether a transaction can be included in a block or not. Merkle trees are created by repeatedly calculating hashing pairs of nodes until there is only one hash left. This hash is called the Merkle Root, or the Root Hash. The Merkle Trees are constructed in a bottom-up approach.

What is a hashed structure?

What is Hashing in Data Structure? Hashing in the data structure is a technique of mapping a large chunk of data into small tables using a hashing function. It is also known as the message digest function. It is a technique that uniquely identifies a specific item from a collection of similar items.

How is a Merkle tree structured?

A Merkle tree is a hash-based data structure that is a generalization of the hash list. It is a tree structure in which each leaf node is a hash of a block of data, and each non-leaf node is a hash of its children. Typically, Merkle trees have a branching factor of 2, meaning that each node has up to 2 children.


1 Answers

If I were to do this, I'd probably do something like the following:

For each leaf node, compute the concatenation of 0 and the hash of the node data.

For each internal node, compute the concatenation of 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.

This will lead to a cascade up the tree every time you change anything, but that MAY be low-enough of an overhead to be worthwhile. If changes are relatively infrequent compared to the amount of changes, it may even make sense to go for a cryptographically secure hash.

Edit1: There is also the possibility of adding a "hash valid" flag to each node and simply propagate a "false" up the tree (or "hash invalid" and propagate "true") up the tree on a node change. That way, it may be possible to avoid a complete recalculation when the tree hash is needed and possibly avoid multiple hash calculations that are not used, at the risk of slightly less predictable time to get a hash when needed.

Edit3: The hash code suggested by Noldorin in the question looks like it would have a chance of collisions, if the result of GetHashCode can ever be 0. Essentially, there is no way of distinguishing a tree composed of a single node, with "symbol hash" 30 and "value hash" 25 and a two-node tree, where the root has a "symbol hash" of 0 and a "value hash" of 30 and the child node has a total hash of 25. The examples are entirely invented, I don't know what expected hash ranges are so I can only comment on what I see in the presented code.

Using 31 as the multiplicative constant is good, in that it will cause any overflow to happen on a non-bit boundary, although I am thinking that, with sufficient children and possibly adversarial content in the tree, the hash contribution from items hashed early MAY be dominated by later hashed items.

However, if the hash performs decently on expected data, it looks as if it will do the job. It's certainly faster than using a cryptographic hash (as done in the example code listed below).

Edit2: As for specific algorithms and minimum data structure needed, something like the following (Python, translating to any other language should be relatively easy).

 #! /usr/bin/env  python  import Crypto.Hash.SHA  class Node:     def __init__ (self, parent=None, contents="", children=[]):         self.valid = False         self.hash = False         self.contents = contents         self.children = children       def append_child (self, child):         self.children.append(child)          self.invalidate()      def invalidate (self):         self.valid = False         if self.parent:             self.parent.invalidate()      def gethash (self):         if self.valid:             return self.hash          digester = crypto.hash.SHA.new()          digester.update(self.contents)          if self.children:             for child in self.children:                 digester.update(child.gethash())             self.hash = "1"+digester.hexdigest()         else:             self.hash = "0"+digester.hexdigest()          return self.hash      def setcontents (self):         self.valid = False         return self.contents 
like image 127
Vatine Avatar answered Sep 27 '22 18:09

Vatine