I have wondered on how to make an efficient comparison of two collection types (lists, sets, maps, etc.). It should be noted that it is structural equality that is desired not reference-based equality.
Usually one has to iterate through all the elements of the collection and running a comparison between them at a cost of O(1) per comparison yielding astonishing O(n) comparison time.
This can impact using a hash-table of lists where a collision check is fairly expensive or using contractual design (e.g. comparing and old collection with new).
I have though of ways to determine fast solutions, but they all seem probalistic/non-deterministic. The ideas is if one is able to use some kind of unique hash of all the elements which can be stored and compared. A good hash algorithm should provide enough enthropy such that there is a small possibility of collision.
This hash-based comparison technique could be strengthen by using somekind of constant time comparison of list head (like comparing the first 10 elements). Two lists with the same elements at the start and using a good hash algorithm should provide in theory somewhat of unique comparison.
Is it possible to create a kind of constant-time comparison (both generalized and specialized on some time like integers), and can it be achieved through a unique-hash technique?
To clarify the question, I do not need a perfect equality check but a quick "pre-equality" check, as a way to speed up a real equality check afterwards. While many hash-code implementations are useful for set comparison, I am also interested in list (ordered) comparison.
I took a couple of minutes to write such a collection class in C#, source below. I used the generic System.Collections.ObjectModel.Collection<T>
because it's easy to override its functionality.
Haven't tested it at all, but it should be a solid start IMHO. Note that UpdateHash
takes indices into account (making the hash function slightly better), while an analogue HashedSet<T>
would skip this part.
Also, due to reversibility of the XOR
operator, recalculating the hash on add/remove takes O(1)
complexity. If a better hash is needed, these operations would grow to O(n)
, so I recommend profiling and then deciding what's best.
public class HashedList<T> : Collection<T>, IEquatable<HashedList<T>>
{
private int _hash;
private void UpdateHash(int index, T item)
{
_hash ^= index;
if (item != null)
_hash ^= item.GetHashCode();
}
#region Overriden collection methods
protected override void InsertItem(int index, T item)
{
UpdateHash(index, item);
base.InsertItem(index, item);
}
protected override void RemoveItem(int index)
{
UpdateHash(index, this[index]);
base.RemoveItem(index);
}
protected override void ClearItems()
{
_hash = 0;
base.ClearItems();
}
protected override void SetItem(int index, T item)
{
UpdateHash(index, this[index]);
UpdateHash(index, item);
base.SetItem(index, item);
}
#endregion
#region Value equality
public bool Equals(HashedList<T> other)
{
if (other == null)
return false;
if (object.ReferenceEquals(this, other))
return true;
if (other.Count != this.Count)
return false;
if (other._hash != this._hash)
return false;
return CompareElements(other);
}
private bool CompareElements(HashedList<T> other)
{
for (int i = 0; i < this.Count; i++)
{
if (this[i] == null)
{
if (other[i] != null)
return false;
}
if (this[i].Equals(other[i]) == false)
return false;
}
return true;
}
public override bool Equals(object obj)
{
var hashed = obj as HashedList<T>;
if (hashed != null)
return Equals(hashed);
return base.Equals(obj);
}
public override int GetHashCode()
{
return _hash;
}
#endregion
}
You could also argue that object.Equals
should return true if any IList<T>
implementation with same elements is passed, but since their hash codes would then differ, it would break consistency. This is the recommended implementation for object.Equals
IIRC.
Use hash based comparison.
Hash( SetA ) vs Hash (SetB ).
PS: You need to sort (or any other deterministic ordering) the elements in the sets before calculating hash. It is possible that hashes may match, but not the collections (due to hash collision), but chances of that happening are pretty low.
PS:PS: I am assuming that the collections are static (or, almost static). In that case you can pre-compute hashes during the creation of the collection itself. So it is O(1) for each comparison. Otherwise, as Groo mentioned, use XOR based hashing which is pretty efficient.
Follow up: Using information theory it is possible to prove that if X and Y each can take 2^n unique values, you need to make at least O(n) comparisons. There is no getting around that. What hashing gives you is the ability to efficiently compare.
One could use bloom filters for this task [for sets]. Each set will also have a bloom filter attached to it.
If the two filters are identical - the structures are probably identical.
If the two filters are not identical - the stuructures are definetly different from each other.
The up side:
No false negatives. If the filters are different - the structures are different.
The down side:
You might have false positives. You will need to have an extra check [full traversal] to make sure that 2 structures are indeed identical.
Note that false positive rate is a function of the bloom filter's size - the bigger it is - the less False Positives you get.
Also note: since bloom filters are actually bitsets - comparing two bloom filters can be implemented very efficiently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With