Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

handling hash collisions in python dictionaries

I have a bunch of dictionaries in python, each dictionary containing user information e.g.:

NewUserDict={'name': 'John', 'age':27}

I collect all these user info dictionaries within a larger dictionary container, using the hash value of each dictionary as the key (Hashing a dictionary?).

What would be the best way to handle hash collisions, when adding new unique users to the dictionary? I was going to manually compare the dictionaries with colliding hash values, and just add some random number to the more recent hash value, e.g.:

if new_hash in larger_dictionary:
    if larger_dictionary[new_hash] != NewUserDict:
        new_hash = new_hash + somerandomnumber

What would be standard way of handling this? Alternatively, how do I know if I should be worrying about collisions in the first place?

like image 849
wenhoo Avatar asked Feb 06 '23 03:02

wenhoo


1 Answers

Generally, you would use the most unique element of your user record. And this usually means that the system in general has a username or a unique ID per record (user), which is guaranteed to be unique. The username or ID would be the unique key for the record. Since this is enforced by the system itself, for example by means of an auto-increment key in a database table, you can be sure that there is no collision.

THAT unique key therefore should be the key in your map to allow you to find a user record.

However, if for some reason you don't have access to such a guranteed-to-be-unique key, you can certainly create a hash from the record (as described by you) and use any of a number of hash table algorithms to store elements with possibly colliding keys. In that case, you don't avoid the collision, but you simply deal with it.

A quick and commonly used algorithm for that goes like this: Use a hash over the record to create a key, as you already do. This key may potentially not be unique. Now store a list of records at the location indicated by the key. We call those lists 'buckets'. To store a new element, hash it and then append it to the list stored at that location (add it to the bucket). To find an element, hash it, find the entry, then sequentially search through the list/bucket at that location to find the entry you want.

Here's an example:

mymap[123] = [ {'name':'John','age':27}, {'name':'Bob','age':19} ]
mymap[678] = [ {'name':'Frank','age':29} ]

In the example, you have your hash table (implemented via a dict). You have hash key value 678, for which one entry is stored in the bucket. Then you have hash key value 123, but there is a collision: Both the 'John' and 'Bob' entry have this hash value. No matter, you find the bucket stored at mymap[123] and iterate over it to find the value.

This is a flexible and very common implementation of hash maps, which doesn't require re-allocation or other complications. It's described in many places, for example here: https://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash_tables.html (in chapter 8.3.1).

Performance generally only becomes an issue when you have lots of collisions (when the lists for each bucket get really long). Something you'll avoid with a good hash function.

However: A true unique ID for your record, enforced by the database for example, is probably still the preferred approach.

like image 188
jbrendel Avatar answered Feb 07 '23 17:02

jbrendel