Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python topN max heap, use heapq or self implement?

Tags:

python

heap

there's heapq in python, for general usage. i want recording topN(0~20) for 10e7 records.

if use heapq, should use '-' to translate max to min; and recording a min number of bottom, to call heapq.heappushpop()

should i use heapq or self implement a heap(maybe buggy or less efficient)?

#update

import heapq
class TopN(object):
    """
    v format: (num, value)

    after looking into http://hg.python.org/cpython/file/2.7/Lib/heapq.py, 
    i find heappushpop already optimize, no need bottom value

    feed() can be optimize further, if needed:
        using func object instead of compare len(self.h) each time
    """
    def __init__(self, N):
        self.N = N
        self.h = []        

    def feed(self, v):  
        if len(self.h) < self.N:
            heapq.heappush(self.h, v)
        else:
            heapq.heappushpop(self.h, v)

    def result(self):
        self.h.sort(reverse=True)
        return self.h

def t_topn():
    topn = TopN(10)
    for i in xrange(5):
        topn.feed((i, str(i)))
    res = topn.result()    
    assert sorted(res, reverse=True) == res 

def t_topn_random():
    import random
    topn = TopN(10)
    for i in xrange(100):
        x = random.randint(0, 1e4)
        topn.feed((x, str(x)))
    res = topn.result()    
    assert sorted(res, reverse=True) == res 

if __name__ == '__main__':
    t_topn()
    t_topn_random()
like image 255
whi Avatar asked Jan 07 '13 03:01

whi


1 Answers

The only problem with heapq is that it doesn't provide a key function like everything else in the stdlib does. (If you're curious why, Raymond Hettinger explains in this email. He's right that heapq couldn't provide the same interface as other sort functions—but the reasons don't affect your use case, where key would just be lambda x: -x.)

The usual workaround is to decorate-heap-undecorate. That is, put a modified version of your values into the heap that sorts by key. Normally, this means one of the following:

  • Storing key(x) instead of x, and then accessing unkey(value) instead of value (assuming key is reversible).
  • Storing (key(x), x) instead of x, and then accessing value[1]. (This can break stability, but heapq doesn't promise stability anyway.)
  • Writing a wrapper class that implements a custom __le__ method, then storing Wrapper(x) instead of x and accessing value.value instead of value.

In your case, the key function is reversible. So, just store -x, and access -value. That's about as trivial as decoration gets.

Still, regardless of how simple it is, you should probably write a wrapper, or you will screw it up at some point. For example, you could write a maxheap that wraps the minheap in heapq like this:

import heapq
def heapify(x):
    for i in range(len(x)):
        x[i] = -x[i]
    heapq.heapify(x)
def heappush(heap, item):
    heapq.heappush(heap, -item)
def heappop(heap):
    return -heapq.heappop(heap)

… and so on for any other functions you need. It may be a bit of a pain, but it's a lot less work than implementing the whole thing from scratch.

While you're at it, you may want to wrap the heap in an object-oriented API so you can do heap.push(x) instead of heapq.heappush(heap, x), etc.

import heapq
class MaxHeap(object):
    def __init__(self, x):
        self.heap = [-e for e in x]
        heapq.heapify(self.heap)
    def push(self, value):
        heapq.heappush(self.heap, -value)
    def pop(self):
        return -heapq.heappop(self.heap)

If you take a quick look around the recipes at ActiveState or the modules on PyPI, you should find that others have already done most of the work for you.

Alternatively, you could copy and paste the heapq source (it's pure Python) as maxheapq.py and just replace the cmp_lt function with its opposite. (Of course if you're doing that, it's probably just as easy, and certainly a lot clearer, to modify cmp_lt to take a key argument in the first place, and modify all the other functions to pass the key through—bearing in mind that it won't be as generally applicable anymore, because it can't make the usual guarantee that key is only called once.)

If you really want to live dangerously (you shouldn't), you could even monkeypatch it:

import heapq
def cmp_gt(x, y):
    return y < x if hasattr(y, '__lt__') else not (x <= y)
heapq.cmp_lt = cmp_gt

But you don't want to do that in real code.

like image 117
abarnert Avatar answered Sep 23 '22 23:09

abarnert