Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LFU cache implementation in python

I have implemented LFU cache in python with the help of Priority Queue Implementation given at https://docs.python.org/2/library/heapq.html#priority-queue-implementation-notes

I have given code in the end of the post.

But I feel that code has some serious problems:
1. To give a scenario, suppose there is only one page is continuously getting visited (say 50 times). But this code will always mark the already added node as "removed" and add it to heap again. So basically it will have 50 different nodes for the same page. Hence increasing heap size enormously.
2. This question is almost similar to Q1 of Telephonic Interview of http://www.geeksforgeeks.org/flipkart-interview-set-2-sde-2/ And the person mentioned that doubly linked list can give better efficiency as compared to heap. Can anyone explain me, how?

from llist import dllist
import sys
from heapq import heappush, heappop

class LFUCache:
    heap = []
    cache_map = {}
    REMOVED = "<removed-task>"

    def __init__(self, cache_size):
        self.cache_size = cache_size

    def get_page_content(self, page_no):
        if self.cache_map.has_key(page_no):
            self.update_frequency_of_page_in_cache(page_no)  
        else:
            self.add_page_in_cache(page_no)

        return self.cache_map[page_no][2]       

    def add_page_in_cache(self, page_no):
        if (len(self.cache_map) == self.cache_size):
            self.delete_page_from_cache() 

        heap_node = [1, page_no, "content of page " + str(page_no)]
        heappush(self.heap, heap_node)
        self.cache_map[page_no] = heap_node

    def delete_page_from_cache(self):
        while self.heap:
            count, page_no, page_content = heappop(self.heap)
            if page_content is not self.REMOVED:
                del self.cache_map[page_no]
                return

    def update_frequency_of_page_in_cache(self, page_no): 
        heap_node = self.cache_map[page_no]
        heap_node[2] = self.REMOVED
        count = heap_node[0]

        heap_node = [count+1, page_no, "content of page " + str(page_no)]
        heappush(self.heap, heap_node)
        self.cache_map[page_no] = heap_node

def main():
    cache_size = int(raw_input("Enter cache size "))
    cache = LFUCache(cache_size)

    while 1:
        page_no = int(raw_input("Enter page no needed "))
        print cache.get_page_content(page_no)
        print cache.heap, cache.cache_map, "\n"


if __name__ == "__main__":
    main() 
like image 227
Shweta Avatar asked Jun 05 '26 11:06

Shweta


1 Answers

Efficiency is a tricky thing. In real-world applications, it's often a good idea to use the simplest and easiest algorithm, and only start to optimize when that's measurably slow. And then you optimize by doing profiling to figure out where the code is slow.

If you are using CPython, it gets especially tricky, as even an inefficient algorithm implemented in C can beat an efficient algorithm implemented in Python due to the large constant factors; e.g. a double-linked list implemented in Python tends to be a lot slower than simply using the normal Python list, even for cases where in theory it should be faster.

Simple algorithm:

For an LFU, the simplest algorithm is to use a dictionary that maps keys to (item, frequency) objects, and update the frequency on each access. This makes access very fast (O(1)), but pruning the cache is slower as you need to sort by frequency to cut off the least-used elements. For certain usage characteristics, this is actually faster than other "smarter" solutions, though.

You can optimize for this pattern by not simply pruning your LFU cache to the maximum length, but to prune it to, say, 50% of the maximum length when it grows too large. That means your prune operation is called infrequently, so it can be inefficient compared to the read operation.

Using a heap:

In (1), you used a heap because that's an efficient way of storing a priority queue. But you are not implementing a priority queue. The resulting algorithm is optimized for pruning, but not access: You can easily find the n smallest elements, but it's not quite as obvious how to update the priority of an existing element. In theory, you'd have to rebalance the heap after every access, which is highly inefficient.

To avoid that, you added a trick by keeping elements around even if they are deleted. But this trades in space for time.

If you don't want to trade in time, you could update the frequencies in-place, and simply rebalance the heap before pruning the cache. You regain fast access times at the expense of slower pruning time, like the simple algorithm above. (I doubt there is any speed difference between the two, but I have not measured this.)

Using a double-linked list:

The double-linked list mentioned in (2) takes advantage of the nature of the possible changes here: An element is either added as the lowest priority (0 accesses), or an existing element's priority is incremented exactly by 1. You can use these attributes to your advantage if you design your data structures like this:

You have a double-linked list of elements which is ordered by the frequency of the elements. In addition, you have a dictionary that maps items to elements within that list.

Accessing an element then means:

  • Either it's not in the dictionary, that is, it's a new item, in which case you can simply append it to the end of the double-linked list (O(1))
  • or it's in the dictionary, in which case you increment the frequency in the element and move it leftwards through the double-linked list until the list is ordered again (O(n) worst-case, but usually closer to O(1)).

To prune the cache, you simply cut off n elements from the end of the list (O(n)).

like image 194
Jorgen Schäfer Avatar answered Jun 07 '26 01:06

Jorgen Schäfer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!