We know that <code>tuple</code> objects are immutable and thus hashable. We also know that <code>lists</code> are mutable and non-hashable. This can be easily illustrated <pre class="prettyprint"><code>>>> set([1, 2, 3, (4, 2), (2, 4)]) {(2, 4), (4, 2), 1, 2, 3} >>> set([1, 2, 3, [4, 2], [2, 4]]) TypeError: unhashable type: 'list' </code></pre> Now, what is the meaning of <code>hash</code> in this context, if, in order to check for uniqueness (e.g. when building a set), we still have to check each individual item in whatever iterables are there in the set anyway? We know two objects can have the same <code>hash</code> value and still be different. So, <code>hash</code> only is not enough to compare objects. So, what is the point of hash? Why not just check each individual items in the iterables direrctly? My intuition is that it could be for one of the reasons <ol> <li> <code>hash</code> is just a (pretty quick) preliminary comparison. If <code>hashes</code> are different, we know objects are different. </li> <li> <code>hash</code> sinalizes that an object is mutable. This should be enough to raise an exception when comparing to other objects: at that specific time, the objects could be equal, but maybe later, they are not.</li> </ol> Am I in the right direction? Or am I missing important piece of this? Thank you

<blockquote> Now, what is the meaning of hash in this context, if, in order to check for uniqueness (e.g. when building a set), we still have to check each individual item in whatever iterables are there in the set anyway? </blockquote> Yes, but the hash is used to make a conservative estimate if two objects can be equal, and is also used to assign a "bucket" to an item. If the hash function is designed carefully, then it is likely (not a certainty) that most, if not all, end up in a different bucket, and as a result, we thus make the membercheck/insertion/removal/... algorithms run on average in constant time O(1), instead of O(n) which is typical for lists. So your first answer is partly correct, although one has to take into account that the buckets definitely boost performance as well, and are actually more important than a conservative check. <h3>Background</h3> <blockquote> Note: I will here use a simplified model, that makes the principle clear, in reality the implementation of a dictionary is more complicated. For example the hashes are here just some numbers that show the principe. </blockquote> A hashset and dictionary is implemented as an array of "buckets". The hash of an element determines in which bucket we store an element. If the number of elements grows, then the number of buckets is increased, and the elements that are already in the dictionary are typically "reassigned" to the buckets. For example, an empty dictionary might look, internally, like: <pre class="prettyprint"><code>+---+ | | | o----> NULL | | +---+ | | | o----> NULL | | +---+ </code></pre> So two buckets, in case we add an element <code>'a'</code>, then the hash is <code>123</code>. Let us consider a simple algorithm to allocate an element to a bucket, here there are two buckets, so we will assign the elements with an even hash to the first bucket, and an odd hash to the second bucket. Since the hash of <code>'a'</code> is odd, we thus assign <code>'a'</code> to the second bucket: <pre class="prettyprint"><code>+---+ | | | o----> NULL | | +---+ | | +---+---+ | o---->| o | o----> NULL | | +-|-+---+ +---+ 'a' </code></pre> So that means if we now check if <code>'b'</code> is a member of the dictionary, we first calculate <code>hash('b')</code>, which is <code>456</code>, and thus if we would have added this to the dictionary, it would be in the first bucket. Since the first bucket is empty, we never have to look for the elements in the second bucket to know for sure that <code>'b'</code> is not a member. If we then for example want to check if <code>'c'</code> is a member, we first generate the hash of <code>'c'</code>, which is <code>789</code>, so we add it to the second bucket again, for example: <pre class="prettyprint"><code>+---+ | | | o----> NULL | | +---+ | | +---+---+ +---+---+ | o---->| o | o---->| o | o----> NULL | | +-|-+---+ +-|-+---+ +---+ 'c' 'a' </code></pre> So now if we again would check if <code>'b'</code> is a member, we would look to the first bucket, and again, we never thus have to iterate over <code>'c'</code> and <code>'a'</code> to know for sure that <code>'b'</code> is not a member of the dictionary. Now of course one might argue that if we keep adding more characters like <code>'e'</code> and <code>'g'</code> (here we consider these to have an odd hash), then that bucket will get quite full, and thus if we later check if <code>'i'</code> is a member, we still will need to iterate over the elements. But in case the number of elements grows, typically the number of buckets will increase as well, and the elements in the dictionary will be assigned a new bucket. For example if we now want to add <code>'d'</code> to the dictionary, the dictionary might note that the number of elements after insertion <code>3</code>, is larger than the number of buckets <code>2</code>, so we create a new array of buckets: <pre class="prettyprint"><code>+---+ | | | o----> NULL | | +---+ | | | o----> NULL | | +---+ | | | o----> NULL | | +---+ | | | o----> NULL | | +---+ </code></pre> and we reassign the members <code>'a'</code> and <code>'c'</code>. Now all elements with a hash <code>h</code> with <code>h % 4 == 0</code> will be assigned to the first bucket, <code>h % 4 == 1</code> to the second bucket, <code>h % 4 == 2</code> to the third bucket, and <code>h % 4 == 3</code> to the last bucket. So that means that <code>'a'</code> with hash <code>123</code> will be stored in the last bucket, and <code>'c'</code> with hash <code>789</code> will be stored in the second bucket, so: <pre class="prettyprint"><code>+---+ | | | o----> NULL | | +---+ | | +---+---+ | o---->| o | o----> NULL | | +-|-+---+ +---+ 'c' | | | o----> NULL | | +---+ | | +---+---+ | o---->| o | o----> NULL | | +-|-+---+ +---+ 'a' </code></pre> we then add <code>'b'</code> with hash <code>456</code> to the first bucket, so: <pre class="prettyprint"><code>+---+ | | +---+---+ | o---->| o | o----> NULL | | +-|-+---+ +---+ 'b' | | +---+---+ | o---->| o | o----> NULL | | +-|-+---+ +---+ 'c' | | | o----> NULL | | +---+ | | +---+---+ | o---->| o | o----> NULL | | +-|-+---+ +---+ 'a' </code></pre> So if we want to check the membership of <code>'a'</code>, we calculate the hash, know that if <code>'a'</code> is in the dictionary, we have to search in the third bucket, and will find it there. If we look for <code>'b'</code> or <code>'c'</code> the same happens (but with a different bucket), and if we look for <code>'d'</code> (here with hash <code>12</code>), then we will search in the third bucket, and will never have to check equality with a single element to know that it is not part of the dictionary. If we want to check if <code>'e'</code> is a member, then we calculate the hash of <code>'e'</code> (here <code>345</code>), and search in the second bucket. Since that bucket is not empty, we start iterating over it. For every element in the bucket (here there is only one), The algorithm will first look if the key we search for, and the key in the node refer to the same object (two different objects can however be equal), since this is not the case, we can not yet claim that <code>'e'</code> is in the dictionary. Next we will compare the hash of the key we search for, and the key of the node. Most dictionary implementations (CPython's dictionaries and sets as well if I recall correctly), then store the hash in the list node as well. So here it checks if <code>345</code> is equal to <code>789</code>, since this is not the case, we know that <code>'c'</code> and <code>'e'</code> are not the same. If it was expensive to compare the two objects, we thus could save some cycles with that. If the hashes are equal, that does not mean that the elements are equal, so in that case, we thus will check if the two objects are equivalent, if that is the case, we know that the element is in the dictionary, otherwise, we know it is not.

This is a high level overview of what happens when you want to find a value in a <code>set</code> (or a key in a <code>dict</code>). A hash table is a sparsely populated array, with its cells being called buckets or bins. <img src="https://i.imgur.com/t5yX9lU.png" alt=""> Good hashing algorithms aim to minimize the chance of hash collisions such that in the average case <code>foo in my_set</code> has time complexity O(1). Performing a linear scan (<code>foo in my_list</code>) over a sequence has time complexity O(n). On the other hand <code>foo in my_set</code> has complexity O(n) only in the worst case with many hash collisions. A small demonstration (with timings done in IPython, copy-pasted from my answer here): <pre class="prettyprint"><code>>>> class stupidlist(list): ...: def __hash__(self): ...: return 1 ...: >>> lists_list = [[i] for i in range(1000)] >>> stupidlists_set = {stupidlist([i]) for i in range(1000)} >>> tuples_set = {(i,) for i in range(1000)} >>> l = [999] >>> s = stupidlist([999]) >>> t = (999,) >>> >>> %timeit l in lists_list 25.5 µs ± 442 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) >>> %timeit s in stupidlists_set 38.5 µs ± 61.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) >>> %timeit t in tuples_set 77.6 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) </code></pre> As you can see, the membership test in our <code>stupidlists_set</code> is even slower than a linear scan over the whole <code>lists_list</code>, while you have the expected super fast lookup time (factor 500) in a <code>set</code> without loads of hash collisions.

What is the meaning of hash if we still need to check every item?

Tags:

python

types

python-3.x

hash

set

We know that tuple objects are immutable and thus hashable. We also know that lists are mutable and non-hashable.

This can be easily illustrated

>>> set([1, 2, 3, (4, 2), (2, 4)])
{(2, 4), (4, 2), 1, 2, 3}

>>> set([1, 2, 3, [4, 2], [2, 4]])
TypeError: unhashable type: 'list'

Now, what is the meaning of hash in this context, if, in order to check for uniqueness (e.g. when building a set), we still have to check each individual item in whatever iterables are there in the set anyway?

We know two objects can have the same hash value and still be different. So, hash only is not enough to compare objects. So, what is the point of hash? Why not just check each individual items in the iterables direrctly?

My intuition is that it could be for one of the reasons

hash is just a (pretty quick) preliminary comparison. If hashes are different, we know objects are different.
hash sinalizes that an object is mutable. This should be enough to raise an exception when comparing to other objects: at that specific time, the objects could be equal, but maybe later, they are not.

Am I in the right direction? Or am I missing important piece of this?

Thank you

883

asked Nov 05 '18 18:11

Oliver

2 Answers

Now, what is the meaning of hash in this context, if, in order to check for uniqueness (e.g. when building a set), we still have to check each individual item in whatever iterables are there in the set anyway?

Yes, but the hash is used to make a conservative estimate if two objects can be equal, and is also used to assign a "bucket" to an item. If the hash function is designed carefully, then it is likely (not a certainty) that most, if not all, end up in a different bucket, and as a result, we thus make the membercheck/insertion/removal/... algorithms run on average in constant time O(1), instead of O(n) which is typical for lists.

So your first answer is partly correct, although one has to take into account that the buckets definitely boost performance as well, and are actually more important than a conservative check.

Background

Note: I will here use a simplified model, that makes the principle clear, in reality the implementation of a dictionary is more complicated. For example the hashes are here just some numbers that show the principe.

A hashset and dictionary is implemented as an array of "buckets". The hash of an element determines in which bucket we store an element. If the number of elements grows, then the number of buckets is increased, and the elements that are already in the dictionary are typically "reassigned" to the buckets.

For example, an empty dictionary might look, internally, like:

+---+
|   |
| o----> NULL
|   |
+---+
|   |
| o----> NULL
|   |
+---+

So two buckets, in case we add an element 'a', then the hash is 123. Let us consider a simple algorithm to allocate an element to a bucket, here there are two buckets, so we will assign the elements with an even hash to the first bucket, and an odd hash to the second bucket. Since the hash of 'a' is odd, we thus assign 'a' to the second bucket:

+---+
|   |
| o----> NULL
|   |
+---+
|   |   +---+---+
| o---->| o | o----> NULL
|   |   +-|-+---+
+---+    'a'

So that means if we now check if 'b' is a member of the dictionary, we first calculate hash('b'), which is 456, and thus if we would have added this to the dictionary, it would be in the first bucket. Since the first bucket is empty, we never have to look for the elements in the second bucket to know for sure that 'b' is not a member.

If we then for example want to check if 'c' is a member, we first generate the hash of 'c', which is 789, so we add it to the second bucket again, for example:

+---+
|   |
| o----> NULL
|   |
+---+
|   |   +---+---+   +---+---+
| o---->| o | o---->| o | o----> NULL
|   |   +-|-+---+   +-|-+---+
+---+    'c'         'a'

So now if we again would check if 'b' is a member, we would look to the first bucket, and again, we never thus have to iterate over 'c' and 'a' to know for sure that 'b' is not a member of the dictionary.

Now of course one might argue that if we keep adding more characters like 'e' and 'g' (here we consider these to have an odd hash), then that bucket will get quite full, and thus if we later check if 'i' is a member, we still will need to iterate over the elements. But in case the number of elements grows, typically the number of buckets will increase as well, and the elements in the dictionary will be assigned a new bucket.

For example if we now want to add 'd' to the dictionary, the dictionary might note that the number of elements after insertion 3, is larger than the number of buckets 2, so we create a new array of buckets:

+---+
|   |
| o----> NULL
|   |
+---+
|   |
| o----> NULL
|   |
+---+
|   |
| o----> NULL
|   |
+---+
|   |
| o----> NULL
|   |
+---+

and we reassign the members 'a' and 'c'. Now all elements with a hash h with h % 4 == 0 will be assigned to the first bucket, h % 4 == 1 to the second bucket, h % 4 == 2 to the third bucket, and h % 4 == 3 to the last bucket. So that means that 'a' with hash 123 will be stored in the last bucket, and 'c' with hash 789 will be stored in the second bucket, so:

+---+
|   |
| o----> NULL
|   |
+---+
|   |   +---+---+
| o---->| o | o----> NULL
|   |   +-|-+---+
+---+    'c'
|   |
| o----> NULL
|   |
+---+
|   |   +---+---+
| o---->| o | o----> NULL
|   |   +-|-+---+
+---+    'a'

we then add 'b' with hash 456 to the first bucket, so:

+---+
|   |   +---+---+
| o---->| o | o----> NULL
|   |   +-|-+---+
+---+    'b'
|   |   +---+---+
| o---->| o | o----> NULL
|   |   +-|-+---+
+---+    'c'
|   |
| o----> NULL
|   |
+---+
|   |   +---+---+
| o---->| o | o----> NULL
|   |   +-|-+---+
+---+    'a'

So if we want to check the membership of 'a', we calculate the hash, know that if 'a' is in the dictionary, we have to search in the third bucket, and will find it there. If we look for 'b' or 'c' the same happens (but with a different bucket), and if we look for 'd' (here with hash 12), then we will search in the third bucket, and will never have to check equality with a single element to know that it is not part of the dictionary.

If we want to check if 'e' is a member, then we calculate the hash of 'e' (here 345), and search in the second bucket. Since that bucket is not empty, we start iterating over it.

For every element in the bucket (here there is only one), The algorithm will first look if the key we search for, and the key in the node refer to the same object (two different objects can however be equal), since this is not the case, we can not yet claim that 'e' is in the dictionary.

Next we will compare the hash of the key we search for, and the key of the node. Most dictionary implementations (CPython's dictionaries and sets as well if I recall correctly), then store the hash in the list node as well. So here it checks if 345 is equal to 789, since this is not the case, we know that 'c' and 'e' are not the same. If it was expensive to compare the two objects, we thus could save some cycles with that.

If the hashes are equal, that does not mean that the elements are equal, so in that case, we thus will check if the two objects are equivalent, if that is the case, we know that the element is in the dictionary, otherwise, we know it is not.

answered Sep 28 '22 18:09

Willem Van Onsem

This is a high level overview of what happens when you want to find a value in a set (or a key in a dict). A hash table is a sparsely populated array, with its cells being called buckets or bins.

Good hashing algorithms aim to minimize the chance of hash collisions such that in the average case foo in my_set has time complexity O(1). Performing a linear scan (foo in my_list) over a sequence has time complexity O(n). On the other hand foo in my_set has complexity O(n) only in the worst case with many hash collisions.

A small demonstration (with timings done in IPython, copy-pasted from my answer here):

>>> class stupidlist(list):
...:    def __hash__(self):
...:        return 1
...: 
>>> lists_list = [[i]  for i in range(1000)]
>>> stupidlists_set = {stupidlist([i]) for i in range(1000)}
>>> tuples_set = {(i,) for i in range(1000)}
>>> l = [999]
>>> s = stupidlist([999])
>>> t = (999,)
>>> 
>>> %timeit l in lists_list
25.5 µs ± 442 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit s in stupidlists_set
38.5 µs ± 61.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit t in tuples_set
77.6 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

As you can see, the membership test in our stupidlists_set is even slower than a linear scan over the whole lists_list, while you have the expected super fast lookup time (factor 500) in a set without loads of hash collisions.

answered Sep 28 '22 19:09

timgeb

Related questions
                            
                                Replace the year in pandas.datetime column
                            
                                Serialize model fields into nested object/dict
                            
                                How to calculate number of years between two dates in different pandas columns
                            
                                Pandas "read_csv" Function Returns NAN for All Blocks in My Table
                            
                                PEP8 Does Not Allow Try Except Block [duplicate]
                            
                                How to ensure tensorflow is using the GPU
                            
                                tf.keras.models.save_model and optimizer warning
                            
                                Django Rest Framework override viewset list() method without loosing filter_backends functionality
                            
                                How do you understand the ioloop in tornado?
                            
                                Python pretty print nested objects
                            
                                how to put column name into data frame cell with specific conditions in pandas
                            
                                How to use different data augmentation for Subsets in PyTorch
                            
                                Keras：load_model ValueError: axes don't match array
                            
                                Convenient way to deal with ValueError: cannot reindex from a duplicate axis
                            
                                Put comments in between multi-line statement (with line continuation)
                            
                                unable to build model as backend.squeeze has no layer
                            
                                Authentication credentials were not provided. when deployed to AWS
                            
                                How do you "clear" only specific Flask session variables?
                            
                                how to add multiple autocomplete in django admin page
                            
                                Filter pandas row where 1st letter in a column is/is-not a certain value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With