there's heapq in python, for general usage. i want recording topN(0~20) for 10e7 records.
if use heapq, should use '-' to translate max to min; and recording a min number of bottom, to call heapq.heappushpop()
should i use heapq or self implement a heap(maybe buggy or less efficient)?
#update
import heapq
class TopN(object):
"""
v format: (num, value)
after looking into http://hg.python.org/cpython/file/2.7/Lib/heapq.py,
i find heappushpop already optimize, no need bottom value
feed() can be optimize further, if needed:
using func object instead of compare len(self.h) each time
"""
def __init__(self, N):
self.N = N
self.h = []
def feed(self, v):
if len(self.h) < self.N:
heapq.heappush(self.h, v)
else:
heapq.heappushpop(self.h, v)
def result(self):
self.h.sort(reverse=True)
return self.h
def t_topn():
topn = TopN(10)
for i in xrange(5):
topn.feed((i, str(i)))
res = topn.result()
assert sorted(res, reverse=True) == res
def t_topn_random():
import random
topn = TopN(10)
for i in xrange(100):
x = random.randint(0, 1e4)
topn.feed((x, str(x)))
res = topn.result()
assert sorted(res, reverse=True) == res
if __name__ == '__main__':
t_topn()
t_topn_random()
The only problem with heapq
is that it doesn't provide a key
function like everything else in the stdlib does. (If you're curious why, Raymond Hettinger explains in this email. He's right that heapq
couldn't provide the same interface as other sort functions—but the reasons don't affect your use case, where key
would just be lambda x: -x
.)
The usual workaround is to decorate-heap-undecorate. That is, put a modified version of your values into the heap that sorts by key
. Normally, this means one of the following:
key(x)
instead of x
, and then accessing unkey(value)
instead of value
(assuming key
is reversible).(key(x), x)
instead of x
, and then accessing value[1]
. (This can break stability, but heapq
doesn't promise stability anyway.)__le__
method, then storing Wrapper(x)
instead of x
and accessing value.value
instead of value
.In your case, the key function is reversible. So, just store -x
, and access -value
. That's about as trivial as decoration gets.
Still, regardless of how simple it is, you should probably write a wrapper, or you will screw it up at some point. For example, you could write a maxheap
that wraps the minheap in heapq
like this:
import heapq
def heapify(x):
for i in range(len(x)):
x[i] = -x[i]
heapq.heapify(x)
def heappush(heap, item):
heapq.heappush(heap, -item)
def heappop(heap):
return -heapq.heappop(heap)
… and so on for any other functions you need. It may be a bit of a pain, but it's a lot less work than implementing the whole thing from scratch.
While you're at it, you may want to wrap the heap in an object-oriented API so you can do heap.push(x)
instead of heapq.heappush(heap, x)
, etc.
import heapq
class MaxHeap(object):
def __init__(self, x):
self.heap = [-e for e in x]
heapq.heapify(self.heap)
def push(self, value):
heapq.heappush(self.heap, -value)
def pop(self):
return -heapq.heappop(self.heap)
…
If you take a quick look around the recipes at ActiveState or the modules on PyPI, you should find that others have already done most of the work for you.
Alternatively, you could copy and paste the heapq
source (it's pure Python) as maxheapq.py
and just replace the cmp_lt
function with its opposite. (Of course if you're doing that, it's probably just as easy, and certainly a lot clearer, to modify cmp_lt
to take a key
argument in the first place, and modify all the other functions to pass the key
through—bearing in mind that it won't be as generally applicable anymore, because it can't make the usual guarantee that key
is only called once.)
If you really want to live dangerously (you shouldn't), you could even monkeypatch it:
import heapq
def cmp_gt(x, y):
return y < x if hasattr(y, '__lt__') else not (x <= y)
heapq.cmp_lt = cmp_gt
But you don't want to do that in real code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With