Reading Guido's infamous answer to the question Sorting a million 32-bit integers in 2MB of RAM using Python, I discovered the module heapq.
I also discover I didn't understand jack about it, nor did I know what I could do with it.
Can you explain to me (with the proverbial 6 years old target) what is the heap queue algorithm for and what you can do with it ?
Can you provide a simple Python snippet where using it (with the heapq
module) solves a problem that will be better solved with it and not with something else ?
A heap is a binary tree data structure (see BinaryTrees) in which each element has a key (or sometimes priority) that is less than the keys of its children. Heaps are used to implement the priority queue abstract data type (see AbstractDataTypes), which we'll talk about first.
The priority queue is the queue data structure and the heap is the tree data structure that operates and organizes data. The priority queue is based on a queue data structure working as a queue with a priority function. The heap is a tree data structure uses for sorting data in a specific order using an algorithm.
Taking an example from real life —Heap Sorting can be applied to a sim card store where there are many customers in line. The customers who have to pay bills can be dealt with first because their work will take less time. This method will save time for many customers in line and avoid unnecessary waiting.
heapq
implements binary heaps, which are a partially sorted data structure. In particular, they have three interesting operations:
heapify
turns a list into a heap, in-place, in O(n) time;heappush
adds an element to the heap in O(lg n) time;heappop
retrieves the smallest element off the heap in O(lg n) time.Many interesting algorithms rely on heaps for performance. The simplest one is probably partial sorting: getting the k smallest (or largest) elements of a list without sorting the entire list. heapq.nsmallest
(nlargest
) does that. The implementation of nlargest
can be paraphrased as:
def nlargest(n, l):
# make a heap of the first n elements
heap = l[:n]
heapify(heap)
# loop over the other len(l)-n elements of l
for i in xrange(n, len(l)):
# push the current element onto the heap, so its size becomes n+1
heappush(heap, l[i])
# pop the smallest element off, so that the heap will contain
# the largest n elements of l seen so far
heappop(heap)
return sorted(heap, reverse=True)
Analysis: let N be the number of elements in l
. heapify
is run once, for a cost of O(n); that's negligible. Then, in a loop running N-n = O(N) times, we perform a heappop
and a heappush
at O(lg n) cost each, giving a total running time of O(N lg n). When N >> n, this is a big win compared to the other obvious algorithm, sorted(l)[:n]
, which takes O(N lg N) time.
For example: you have a set of 1000 floating-point number. You want to repeatedly remove the smallest item from the set and replace it with a random number between 0 and 1. The fastest way to do it is with the heapq module:
heap = [0.0] * 1000
# heapify(heap) # usually you need this, but not if the list is initially sorted
while True:
x = heappop(heap)
heappush(head, random.random())
This takes a time per iteration that is logarithmic in the length of the heap (i.e. around 7 units, for a list of length 1000). Other solutions take a linear time (i.e. around 1000 units, which is 140 times slower, and gets slower and slower when the length increases):
lst = [0.0] * 1000
while True:
x = min(lst) # linear
lst.remove(x) # linear
lst.append(random.random())
or:
lst = [0.0] * 1000
while True:
x = lst.pop() # get the largest one in this example
lst.append(random.random())
lst.sort() # linear (in this case)
or even:
lst = [0.0] * 1000
while True:
x = lst.pop() # get the largest one in this example
bisect.insort(lst, random.random()) # linear
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With