Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is python's dict implemented as hash table whereas std::map is tree-based?

Why one languages uses tree and another uses hash table for seemingly similar data structure?

c++'s map vs python's dict

A related question is about performance of hash table.
Please comment on my understanding of hash table below.

A tree is guaranteed to have O(log n).
Whereas hash table has no guarantee unless inputs are previously known because of possible collisions.
I tend to think hash table's performance would become close to O(n) as problem size gets bigger.
Because I haven't heard of a hash function that dynamically adjust its table size as problem size grows.

Hence, hash table is only useful for certain range of problem size, and that's why most DB uses tree than hash table.

like image 564
eugene Avatar asked Nov 25 '11 06:11

eugene


People also ask

Is a Python dictionary a hash table or hash map?

Yes, it is a hash mapping or hash table. You can read a description of python's dict implementation, as written by Tim Peters, here. You can read more about hash tables or check how it has been implemented in python and why it is implemented that way.

Are Python dictionaries implemented as hash tables?

Python dictionaries are implemented as hash tables. Hash tables must allow for hash collisions i.e. even if two distinct keys have the same hash value, the table's implementation must have a strategy to insert and retrieve the key and value pairs unambiguously.

Is std :: map hash table?

In C++, the sorted map (std::map) is usually implemented as a binary tree, and the unsorted map (std::unordered_map) is a hash table with closed addressing.

Why do we use hash table in dictionary?

In Python, the Dictionary data types represent the implementation of hash tables. The Keys in the dictionary satisfy the following requirements. The keys of the dictionary are hashable i.e. the are generated by hashing function which generates unique result for each unique value supplied to the hash function.


2 Answers

The new C++ standard has the std::unordered_map type which is a hash table. IIRC they wanted it to get into the previous standard as well, but there was not enough time during the discussions so it was left out. However, most popular compilers provided it in one way or another for years.

In other words, don't worry about it too much. Use the proper data structure for the task at hand.


As for your understanding of hash tables, it's inaccurate:

I haven't heard of a hash function that dynamically adjust its table size as problem size grows

All serious hash table implementation dynamically adjust themselves for growing input, by allocating a larger array and re-hashing all the keys. Although this operation is expensive, if designed properly (to be done rarely enough) the performance is still amortized O(1).

like image 93
Eli Bendersky Avatar answered Oct 05 '22 23:10

Eli Bendersky


Your understanding of hash tables (and who use them) is flawed.

The problem is, hash table is a rather vague term. Under the hood there are many implementations... but first let's talk about the use of BST (Binary Search Trees).


Why does C++ uses a Binary Search Tree ?

C++ is designed by commitee, there are many possible implementations of hash tables leading to widely different characteristics while the most popular implementations of BST (Red-Black Tree and AVL Tree) have nearly identical characteristics. Therefore, they did not rejected hash tables outright, they just could not settle on the characteristics to choose and the details to expose to the user.

See James Kanze's comment, the proposal arrived too late and James asks an interesting question as to why Stepanov did not proposed it first. I still suspect the number of choices to be the culprit.

Why do databases use Search Trees ?

First of all, let's settle on a database software. I'll pick Oracle because it's both widely documented and so typical of SQL databases. Oracle offers two types of indexes: Bitmap and Search Trees.

Note: they do not use BINARY Search Trees, but instead use B+Trees which are much more IO and cache friendly

There is a fundamental difference between a Hash Table and a Search Tree: the latter is sorted. Many databases operations imply sorting:

  • get the nth element
  • get the top n elements
  • get the elements in [a,b]

In all those cases, a Hash Table is useless.

Furthermore, databases need to juggle with huge datasets (in general), which means that they need to organize their data in order to minimize IO (disk read/write). Here, the sorted nature of a Search Tree mean that (in the index) elements that are likely to be accessed together (because they share much) will also be grouped together instead of being scattered to the four corners of the disk.

Finally, internally Oracle may use Hash Tables in its execution plan. When you perform an operation that requires the intersection of two sets of rows, the optimization engine may decide that storing the (temporary) sets in Hash Tables is the fastest way to go.


Now, regarding performance.

Indeed, the performance of Search Trees is generally well-known and easy to understand O(log N) is nice and tidy.

On the other hand, as I said, there are many different Hash Tables implementation possible, as well as strategies to handle both growth and shrink... definitely more complicated.

A simple example of structure, a Hash Table may use:

  • Open Addressing: the hash table is an array of elements, the hash indicates the slot of the array in which to put the element, if the slot is full there is a strategy to locate another slot. The same strategy is used for searching.
  • Buckets: the hash table is an array of pointers to buckets, the hash indicates the slot of the bucket in which to put the elements. It is assumed that the bucket can grow infinitely.

Those two strategies have extremely different characteristics, and the latter characteristics also depend on the buckets implementations (the easy implementation is to use a simple linked-list).

But even if you pick an implementation, its performance is based on the hash function distribution, which varies depending on the input sequence itself!


My personal advice ? To pick between unordered_map and map in C++, I simply ask myself about whether I need sorted elements or not. If I need them to be sorted I use a map, otherwise I use an unordered_map. Most of the times, the performances are just as good anyway, so it's just semantics.

like image 30
Matthieu M. Avatar answered Oct 06 '22 00:10

Matthieu M.