Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure with unique elements and fast add and remove

I need a data structure with the following properties:

  • Each element of the structure must be unique.
  • Add: Adds one element to the data structure unless the element already exists.
  • Pop: Removes one element from the data structure and returns the element removed. It's unimportant which element is removed.

No other operations are required for this structure. A naive implementation with a list will require almost O(1) time for Pop and O(N) time for Add (since the entire list must be checked to ensure uniqueness). I am currently using a red-black tree to fulfill the needs of this data structure, but I am wondering if I can use something less complicated to achieve almost the same performance.

I prefer answers in C#, but Java, Javascript, and C++ are also acceptable.

My question is similar to this question, however I have no need to lookup or remove the maximum or minimum value (or indeed any particular kind of value), so I was hoping there would be improvements in that respect. If any of the structures in that question are appropriate here, however, let me know.

So, what data structure allows only unique elements, supports fast add and remove, and is less complicated than a red-black tree?

like image 471
Peter O. Avatar asked Aug 31 '11 18:08

Peter O.


3 Answers

What about the built-in HashSet<T>?

It contains only unique elements. Remove (pop) is O(1) and Add is O(1) unless the internal array must be resized.

like image 159
Meta-Knight Avatar answered Nov 07 '22 11:11

Meta-Knight


As said by Meta-Knight, a HashSet is the fastest data structure to do exactly that. Lookups and removals take constant O(1) time (except in rare cases when your hash sucks and then you require multiple rehashes or you use a bucket hashset). All operations on a hashset take O(1) time, the only drawback is that it requires more memory because the hash is used as an index into an array (or other allocated block of memory). So unless you're REALLY strict on memory then go with HashSet. I'm only explaining the reason why you should go with this approach and you should accept Meta-Knights answer as his was first.

Using hashes is OK because usually you override the HashCode() and Equals() functions. What the HashSet does internally is generate the hash, then if it is equal check for equality (just in case of hash collisions). If they are not it must call a method to do something called rehashing which generates a new hash which is usually at an odd prime offset from the original hash (not sure if .NET does this but other languages do) and repeats the process as necessary.

like image 24
Jesus Ramos Avatar answered Nov 07 '22 09:11

Jesus Ramos


Remove a random element is quite easy from an hashset or a dictionary. Everything is averaged O(1), that in real world means O(1). Example:

public class MyNode
{
    ...
}

public class MyDataStructure
{
    private HashSet<MyNode> nodes = new HashSet<MyNode>();

    /// <summary>
    /// Inserts an element to this data structure. 
    /// If the element already exists, returns false.
    /// Complexity is averaged O(1).
    /// </summary>
    public bool Add(MyNode node)
    {
        return node != null && this.nodes.Add(node);
    }

    /// <summary>
    /// Removes a random element from the data structure.
    /// Returns the element if an element was found.
    /// Returns null if the data structure is empty.
    /// Complexity is averaged O(1).
    /// </summary>
    public MyNode Pop()
    {
        // This loop can execute 1 or 0 times.
        foreach (MyNode node in nodes)
        {
            this.nodes.Remove(node);
            return node;
        }
        return null;
    }
}

Almost everything that can be compared can also be hashed :) in my experience. I would like to know if there is someone that know something that cannot be hashed.

To my experience this applies also to some floating point comparations with tolerance using special techniques.

An hash function for an hash table don't need to be perfect, it just need to be good enough. Also if your data is very complicated usually hash functions are less complicated than red black trees or avl trees. They are useful because they keep things ordered, but you don't need this.

To show how to do a simple hashset i will consider a simple dictionary with integer keys. This implementation is very fast and very good for sparse arrays for examples. I didn't write the code to grow the bucket table, because it is annoying and usually a source of big bugs, but since this is a proof of concept, it should be enough. I didn't write iterator neither.

I wrote it by scratch, there may be bugs.

public class FixedIntDictionary<T>
{
    // Our internal node structure.
    // We use structs instead of objects to not add pressure to the garbage collector.
    // We mantains our own way to manage garbage through the use of a free list.
    private struct Entry
    {
        // The key of the node
        internal int Key;

        // Next index in pEntries array.
        // This field is both used in the free list, if node was removed
        // or in the table, if node was inserted.
        // -1 means null.
        internal int Next;

        // The value of the node.
        internal T Value;
    }

    // The actual hash table. Contains indices to pEntries array.
    // The hash table can be seen as an array of singlt linked list.
    // We store indices to pEntries array instead of objects for performance
    // and to avoid pressure to the garbage collector.
    // An index -1 means null.
    private int[] pBuckets;

    // This array contains the memory for the nodes of the dictionary.
    private Entry[] pEntries;

    // This is the first node of a singly linked list of free nodes.
    // This data structure is called the FreeList and we use it to
    // reuse removed nodes instead of allocating new ones.
    private int pFirstFreeEntry;

    // Contains simply the number of items in this dictionary.
    private int pCount;

    // Contains the number of used entries (both in the dictionary or in the free list) in pEntries array.
    // This field is going only to grow with insertions.
    private int pEntriesCount;

    ///<summary>
    /// Creates a new FixedIntDictionary. 
    /// tableBucketsCount should be a prime number
    /// greater than the number of items that this
    /// dictionary should store.
    /// The performance of this hash table will be very bad
    /// if you don't follow this rule!
    /// </summary>
    public FixedIntDictionary<T>(int tableBucketsCount)
    {
        // Our free list is initially empty.
        this.pFirstFreeEntry = -1;

        // Initializes the entries array with a minimal amount of items.
        this.pEntries = new Entry[8];

        // Allocate buckets and initialize every linked list as empty.
        int[] buckets = new int[capacity];
        for (int i = 0; i < buckets.Length; ++i)
            buckets[i] = -1;

        this.pBuckets = buckets;
    }

    ///<summary>Gets the number of items in this dictionary. Complexity is O(1).</summary>
    public int Count
    {
        get { return this.pCount; }
    }

    ///<summary>
    /// Adds a key value pair to the dictionary.
    /// Complexity is averaged O(1).
    /// Returns false if the key already exists.
    /// </summary>
    public bool Add(int key, T value)
    {
        // The hash table can be seen as an array of linked list.
        // We find the right linked list using hash codes.
        // Since the hash code of an integer is the integer itself, we have a perfect hash.

        // After we get the hash code we need to remove the sign of it.
        // To do that in a fast way we and it with 0x7FFFFFFF, that means, we remove the sign bit.
        // Then we have to do the modulus of the found hash code with the size of our buckets array.

        // For this reason the size of our bucket array should be a prime number,
        // this because the more big is the prime number, the less is the chance to find an
        // hash code that is divisible for that number. This reduces collisions.

        // This implementation will not grow the buckets table when needed, this is the major
        // problem with this implementation.
        // Growing requires a little more code that i don't want to write now
        // (we need a function that finds prime numbers, and it should be fast and we
        // need to rehash everything using the new buckets array).

        int bucketIndex = (key & 0x7FFFFFFF) % this.pBuckets.Length;
        int bucket = this.pBuckets[bucketIndex];

        // Now we iterate in the linked list of nodes.
        // Since this is an hash table we hope these lists are very small.
        // If the number of buckets is good and the hash function is good this will translate usually 
        // in a O(1) operation.

        Entry[] entries = this.pEntries;
        for (int current = entries[bucket]; current != -1; current = entries[current].Next)
        {
            if (entries[current].Key == key)
            {
                // Entry already exists.
                return false;
            }
        }

        // Ok, key not found, we can add the new key and value pair.

        int entry = this.pFirstFreeEntry;
        if (entry != -1)
        {
            // We found a deleted node in the free list.
            // We can use that node without "allocating" another one.
            this.pFirstFreeEntry = entries[entry].Next;
        }
        else
        {
            // Mhhh ok, the free list is empty, we need to allocate a new node.
            // First we try to use an unused node from the array.
            entry = this.pEntriesCount++;
            if (entry >= this.pEntries)
            {
                // Mhhh ok, the entries array is full, we need to make it bigger.
                // Here should go also the code for growing the bucket table, but i'm not writing it here.
                Array.Resize(ref this.pEntries, this.pEntriesCount * 2);
                entries = this.pEntries;
            }
        }

        // Ok now we can add our item.
        // We just overwrite key and value in the struct stored in entries array.

        entries[entry].Key = key;
        entries[entry].Value = value;

        // Now we add the entry in the right linked list of the table.

        entries[entry].Next = this.pBuckets[bucketIndex];
        this.pBuckets[bucketIndex] = entry;

        // Increments total number of items.
        ++this.pCount;

        return true;
    }

    /// <summary>
    /// Gets a value that indicates wether the specified key exists or not in this table.
    /// Complexity is averaged O(1).
    /// </summary>
    public bool Contains(int key)
    {
        // This translate in a simple linear search in the linked list for the right bucket.
        // The operation, if array size is well balanced and hash function is good, will be almost O(1).

        int bucket = this.pBuckets[(key & 0x7FFFFFFF) % this.pBuckets.Length];
        Entry[] entries = this.pEntries;
        for (int current = entries[bucket]; current != -1; current = entries[current].Next)
        {
            if (entries[current].Key == key)
            {
                return true;
            }
        }
        return false;
    }

    /// <summary>
    /// Removes the specified item from the dictionary.
    /// Returns true if item was found and removed, false if item doesn't exists.
    /// Complexity is averaged O(1).
    /// </summary>
    public bool Remove(int key)
    {
        // Removal translate in a simple contains and removal from a singly linked list.
        // Quite simple.

        int bucketIndex = (key & 0x7FFFFFFF) % this.pBuckets.Length;
        int bucket = this.pBuckets[bucketIndex];
        Entry[] entries = this.pEntries;
        int next;
        int prev = -1;
        int current = entries[bucket];

        while (current != -1)
        {
            next = entries[current].Next;

            if (entries[current].Key == key)
            {
                // Found! Remove from linked list.
                if (prev != -1)
                    entries[prev].Next = next;
                else
                    this.pBuckets[bucketIndex] = next;

                // We now add the removed node to the free list,
                // so we can use it later if we add new elements.
                entries[current].Next = this.pFirstFreeEntry;
                this.pFirstFreeEntry = current;

                // Decrements total number of items.
                --this.pCount;

                return true;
            }

            prev = current;
            current = next;
        }
        return false;
    }

}

If you wander if this implementation is good or not, is a very similar implementation of what the .NET framework do for Dictionary class :)

To make it an hashset, just remove the T and you have an hashset of integers. If you need to get hashcodes for generic objects, just use x.GetHashCode or provide your hash code function.

To write iterators you need to modify several things, but don't want to add too much other things in this post :)

like image 44
Salvatore Previti Avatar answered Nov 07 '22 11:11

Salvatore Previti