Why is python's dict implemented as hash table whereas std::map is tree-based?

Tags:

Why one languages uses tree and another uses hash table for seemingly similar data structure?

c++'s map vs python's dict

A related question is about performance of hash table.
Please comment on my understanding of hash table below.

A tree is guaranteed to have O(log n).
Whereas hash table has no guarantee unless inputs are previously known because of possible collisions.
I tend to think hash table's performance would become close to O(n) as problem size gets bigger.
Because I haven't heard of a hash function that dynamically adjust its table size as problem size grows.

Hence, hash table is only useful for certain range of problem size, and that's why most DB uses tree than hash table.

564

asked Nov 25 '11 06:11

eugene

2 Answers

The new C++ standard has the std::unordered_map type which is a hash table. IIRC they wanted it to get into the previous standard as well, but there was not enough time during the discussions so it was left out. However, most popular compilers provided it in one way or another for years.

In other words, don't worry about it too much. Use the proper data structure for the task at hand.

As for your understanding of hash tables, it's inaccurate:

I haven't heard of a hash function that dynamically adjust its table size as problem size grows

All serious hash table implementation dynamically adjust themselves for growing input, by allocating a larger array and re-hashing all the keys. Although this operation is expensive, if designed properly (to be done rarely enough) the performance is still amortized O(1).

answered Oct 05 '22 23:10

Eli Bendersky

Your understanding of hash tables (and who use them) is flawed.

The problem is, hash table is a rather vague term. Under the hood there are many implementations... but first let's talk about the use of BST (Binary Search Trees).

Why does C++ uses a Binary Search Tree ?

C++ is designed by commitee, there are many possible implementations of hash tables leading to widely different characteristics while the most popular implementations of BST (Red-Black Tree and AVL Tree) have nearly identical characteristics. Therefore, they did not rejected hash tables outright, they just could not settle on the characteristics to choose and the details to expose to the user.

See James Kanze's comment, the proposal arrived too late and James asks an interesting question as to why Stepanov did not proposed it first. I still suspect the number of choices to be the culprit.

Why do databases use Search Trees ?

First of all, let's settle on a database software. I'll pick Oracle because it's both widely documented and so typical of SQL databases. Oracle offers two types of indexes: Bitmap and Search Trees.

Note: they do not use BINARY Search Trees, but instead use B+Trees which are much more IO and cache friendly

There is a fundamental difference between a Hash Table and a Search Tree: the latter is sorted. Many databases operations imply sorting:

get the nth element
get the top n elements
get the elements in [a,b]

In all those cases, a Hash Table is useless.

Furthermore, databases need to juggle with huge datasets (in general), which means that they need to organize their data in order to minimize IO (disk read/write). Here, the sorted nature of a Search Tree mean that (in the index) elements that are likely to be accessed together (because they share much) will also be grouped together instead of being scattered to the four corners of the disk.

Finally, internally Oracle may use Hash Tables in its execution plan. When you perform an operation that requires the intersection of two sets of rows, the optimization engine may decide that storing the (temporary) sets in Hash Tables is the fastest way to go.

Now, regarding performance.

Indeed, the performance of Search Trees is generally well-known and easy to understand O(log N) is nice and tidy.

On the other hand, as I said, there are many different Hash Tables implementation possible, as well as strategies to handle both growth and shrink... definitely more complicated.

A simple example of structure, a Hash Table may use:

Open Addressing: the hash table is an array of elements, the hash indicates the slot of the array in which to put the element, if the slot is full there is a strategy to locate another slot. The same strategy is used for searching.
Buckets: the hash table is an array of pointers to buckets, the hash indicates the slot of the bucket in which to put the elements. It is assumed that the bucket can grow infinitely.

Those two strategies have extremely different characteristics, and the latter characteristics also depend on the buckets implementations (the easy implementation is to use a simple linked-list).

But even if you pick an implementation, its performance is based on the hash function distribution, which varies depending on the input sequence itself!

My personal advice ? To pick between unordered_map and map in C++, I simply ask myself about whether I need sorted elements or not. If I need them to be sorted I use a map, otherwise I use an unordered_map. Most of the times, the performances are just as good anyway, so it's just semantics.

answered Oct 06 '22 00:10

Matthieu M.

Related questions
                            
                                Creating a thread pool using boost
                            
                                Redirecting/redefining print() for embedded Lua
                            
                                How do I find the current directory?
                            
                                Which method is better for implementing get/set?
                            
                                Is memory released when a destructor is called or when `delete` is called?
                            
                                How to print member function address in C++
                            
                                Operator Overloading in struct
                            
                                Comparing the values of char arrays in C++
                            
                                how to use std::vector::emplace_back for vector<vector<int> >?
                            
                                How can I avoid writing `::value` and `::type` when using `std::enable_if`? [cppx]
                            
                                Algorithm to draw waveform from audio
                            
                                Find max/min of vector of vectors
                            
                                Operator Overloading in C++ as int + obj
                            
                                Get visible rectangle of QGraphicsView?
                            
                                How to allocate a 2D array of pointers in C++
                            
                                Why does my program run way faster when I enable profiling?
                            
                                Do console apps run faster than GUI apps? [closed]
                            
                                Can I use the not operator in C++ on int values?
                            
                                Sorting a std::vector<std::pair<std::string,bool>> by the string?
                            
                                Checking the status of a child process in C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is python's dict implemented as hash table whereas std::map is tree-based?

Tags:

c++

python

hashtable

map

eugene

People also ask

2 Answers

Eli Bendersky

Matthieu M.

Recent Activity

Donate For Us