I would like to know how a set is implemented in C++. If I were to implement my own set container without using the STL provided container, what would be the best way to go about this task?
I understand STL sets are based on the abstract data structure of a binary search tree. So what is the underlying data structure? An array?
Also, how does insert()
work for a set? How does the set check whether an element already exists in it?
I read on wikipedia that another way to implement a set is with a hash table. How would this work?
I understand STL sets are based on the abstract data structure of a binary search tree.
A set is a data structure that stores unique elements of the same type in a sorted order. Each value is a key, which means that we access each value using the value itself. With arrays, on the other hand, we access each value by its position in the container (the index). Accordingly, each value in a set must be unique.
A vector, in computing, is generally a one-dimensional array, typically storing numbers. Vectors typically have fixed sizes, unlike lists and queues. The vector data structure can be used to represent the mathematical vector used in linear algebra. See related pages for mathematical vector operations.
C++'s Standard Template Library (STL) C++ comes with a large library of useful data structures, including resizable arrays ( std::vector ), linked lists ( std::list ), ordered search trees ( std::map ), hash tables ( std::unordered_map ), and sets ( std::set and std::unordered_set ).
I understand STL sets are based on the abstract data structure of a binary search tree. So what is the underlying data structure? An array? As others have pointed out, it varies. A set is commonly implemented as a tree (red-black tree, balanced tree, etc.) though but there may be other implementations that exist.
The C++ Standard Template Library (STL) The Standard Template Library (STL) is a set of C++ template classes to provide common programming data structures and functions such as lists, stacks, arrays, etc. It is a library of container classes, algorithms, and iterators.
The Standard Template Library (STL) is a set of C++ template classes to provide common programming data structures and functions such as lists, stacks, arrays, etc. It is a library of container classes, algorithms, and iterators.
So what is the underlying data structure? An array? As others have pointed out, it varies. A set is commonly implemented as a tree (red-black tree, balanced tree, etc.) though but there may be other implementations that exist. Also, how does insert () work for a set? It depends on the underlying implementation of your set.
Step debug into g++
6.4 stdlibc++ source
Did you know that on Ubuntu's 16.04 default g++-6
package or a GCC 6.4 build from source you can step into the C++ library without any further setup?
By doing that we easily conclude that a Red-black tree used in this implementation.
This makes sense, since std::set
can be traversed in order, which would not be efficient in if a hash map were used.
main.cpp
#include <cassert>
#include <set>
int main() {
std::set<int> s;
s.insert(1);
s.insert(2);
assert(s.find(1) != s.end());
assert(s.find(2) != s.end());
assert(s.find(3) == s3.end());
}
Compile and debug:
g++ -g -std=c++11 -O0 -o main.out main.cpp
gdb -ex 'start' -q --args main.out
Now, if you step into s.insert(1)
you immediately reach /usr/include/c++/6/bits/stl_set.h
:
487 #if __cplusplus >= 201103L
488 std::pair<iterator, bool>
489 insert(value_type&& __x)
490 {
491 std::pair<typename _Rep_type::iterator, bool> __p =
492 _M_t._M_insert_unique(std::move(__x));
493 return std::pair<iterator, bool>(__p.first, __p.second);
494 }
495 #endif
which clearly just forwards to _M_t._M_insert_unique
.
So we open the source file in vim and find the definition of _M_t
:
typedef _Rb_tree<key_type, value_type, _Identity<value_type>,
key_compare, _Key_alloc_type> _Rep_type;
_Rep_type _M_t; // Red-black tree representing set.
So _M_t
is of type _Rep_type
and _Rep_type
is a _Rb_tree
.
OK, now that is enough evidence for me. If you don't believe that _Rb_tree
is a Black-red tree, step a bit further and read the algorithm.
unordered_set
uses hash table
Same procedure, but replace set
with unordered_set
on the code.
This makes sense, since std::unordered_set
cannot be traversed in order, so the standard library chose hash map instead of Red-black tree, since hash map has a better amortized insert time complexity.
Stepping into insert
leads to /usr/include/c++/6/bits/unordered_set.h
:
415 std::pair<iterator, bool>
416 insert(value_type&& __x)
417 { return _M_h.insert(std::move(__x)); }
So we open the source file in vim
and search for _M_h
:
typedef __uset_hashtable<_Value, _Hash, _Pred, _Alloc> _Hashtable;
_Hashtable _M_h;
So hash table it is.
std::map
and std::unordered_map
Analogous for std::set
vs std:unordered_set
: What data structure is inside std::map in C++?
Performance characteristics
You could also infer the data structure used by timing them:
Graph generation procedure and Heap vs BST analysis and at: Heap vs Binary Search Tree (BST)
We clearly see for:
std::set
, a logarithmic insertion timestd::unordered_set
, a more complex hashmap pattern:
on the zoomed plot, we see that the times are basically constant and going towards 250ns, therefore much faster than the std::map
, except for very small map sizes
Several strips are clearly visible, and their inclination becomes smaller whenever the array doubles.
I believe this is due to average linearly increasing linked list walks withing each bin. Then when the array doubles, we have more bins, so shorter walks.
As KTC said, how std::set
is implemented can vary -- the C++ standard simply specifies an abstract data type. In other words, the standard does not specify how a container should be implemented, just what operations it is required to support. However, most implementations of the STL do, as far as I am aware, use red-black trees or other balanced binary search trees of some kind (GNU libstdc++, for instance, uses red-black trees).
While you could theoretically implement a set as a hash table and get faster asymptotic performance (amortized O(key length) versus O(log n) for lookup and insert), that would require having the user supply a hash function for whatever type they wanted to store (see Wikipedia's entry on hash tables for a good explanation of how they work). As for an implementation of a binary search tree, you wouldn't want to use an array -- as Raul mentioned, you would want some kind of Node
data structure.
You could implement a binary search tree by first defining a Node
struct:
struct Node
{
void *nodeData;
Node *leftChild;
Node *rightChild;
}
Then, you could define a root of the tree with another Node *rootNode;
The Wikipedia entry on Binary Search Tree has a pretty good example of how to implement an insert method, so I would also recommend checking that out.
In terms of duplicates, they are generally not allowed in sets, so you could either just discard that input, throw an exception, etc, depending on your specification.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With