Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to maintain dictionary in a heap in python?

Tags:

I have a dictionary as follows:

{'abc':100,'xyz':200,'def':250 .............}

It is a dictionary with keys as a name of a entity and the value is count of that entity. I need to return the top 10 elements of from the dictionary.

I can write a heap to do it, but I'm not sure how to do value to key mapping as certain values will be equal.

Is there any other data structure to do this?

like image 995
gizgok Avatar asked Feb 10 '13 06:02

gizgok


People also ask

How do you maintain a dictionary in Python?

From Python 3.6 onwards, the standard dict type maintains insertion order by default. will result in a dictionary with the keys in the order listed in the source code.

How do you initialize a heap in Python?

To add a new element to the heap, you append it to the end of the array and then call swim repeatedly until the new element finds its place in the heap. To delete the root, you swap it with the last element in the array, delete it and then call sink until the swapped element finds its place.

How dictionaries are stored in memory?

The dictionary is stored in memory in the form of a hash table with an ordered array of ranges and their corresponding values. The dictionary key has the UInt64 type. This storage method works the same way as hashed and allows using date/time (arbitrary numeric type) ranges in addition to the key.

How does Python define Heapq?

Heap queue is a special tree structure in which each parent node is less than or equal to its child node. In python it is implemented using the heapq module. It is very useful is implementing priority queues where the queue item with higher weight is given more priority in processing.


1 Answers

Using heapq you probably want to do something like this:

heap = [(-value, key) for key,value in the_dict.items()]
largest = heapq.nsmallest(10, heap)
largest = [(key, -value) for value, key in largest]

Note that since heapq implements only a min heap it's better to invert the values, so that bigger values become smaller.

This solution will be slower for small sizes of the heap, for example:

>>> import random
>>> import itertools as it
>>> def key_generator():
...     characters = [chr(random.randint(65, 90)) for x in range(100)]
...     for i in it.count():
...             yield ''.join(random.sample(characters, 3))
... 
>>> the_dict = dict((key, random.randint(-500, 500)) for key, _ in zip(key_generator(), range(3000)))
>>> def with_heapq(the_dict):
...     items = [(-value, key) for key, value in the_dict.items()]
...     smallest = heapq.nsmallest(10, items)
...     return [-value for value, key in smallest]
... 
>>> def with_sorted(the_dict):
...     return sorted(the_dict.items(), key=(lambda x: x[1]), reverse=True)[:10]
... 
>>> import timeit
>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
0.9220538139343262
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
1.2792410850524902

With 3000 values it's just slightly faster than the sorted version, which is O(nlogn) instead of O(n + mlogn). If we increase the size of the dict to 10000 the heapq version becomes even faster:

>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
2.436316967010498
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
3.585728168487549

The timings probably depends also on the machine on which you are running. You should probably profile which solution works best in your case. If the efficiency is not critical I'd suggest to use the sorted version because it's simpler.

like image 155
Bakuriu Avatar answered Oct 16 '22 12:10

Bakuriu