Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Top K Frequent Words using heaps in Python [duplicate]

Tags:

python

heap

I'm trying to solve the Top K Frequent Words Leetcode problem in O(N log K) time and am getting an undesirable result. My Python3 code and console output are below:

from collections import Counter
import heapq

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:
        
        counts = Counter(words)
        print('Word counts:', counts)
        
        result = []
        for word in counts:
            print('Word being added:', word)
            if len(result) < k:
                heapq.heappush(result, (-counts[word], word))
                print(result)
            else:
                heapq.heappushpop(result, (-counts[word], word))
        result = [r[1] for r in result]
        
        return result

----------- Console output -----------

Word counts: Counter({'the': 3, 'is': 3, 'sunny': 2, 'day': 1})
Word being added: the
[(-3, 'the')]
Word being added: day
[(-3, 'the'), (-1, 'day')]
Word being added: is
[(-3, 'is'), (-1, 'day'), (-3, 'the')]
Word being added: sunny
[(-3, 'is'), (-2, 'sunny'), (-3, 'the'), (-1, 'day')]

When I run the test case ["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"] with K = 4, I find that the word the gets shifted to the end of the list (after day) once is is added even though they both have a count of 3. This makes sense since the parent need only be <= the children and the children are not ordered in any way. Since (-2, 'sunny') and (-3, 'the') are both > (-3, 'is'), the heap invariant is, in fact, maintained even though (-3, 'the') < (-2, 'sunny') and is the right child of (-3, 'is'). The expected result is ["is","the","sunny","day"] while the output of my code is ["is","sunny","the","day"].

Should I be using heaps to solve this problem in O(N log K) time, and if so, how can I modify my code to achieve the desired result?

like image 953
Vivek Subramanian Avatar asked Nov 11 '20 00:11

Vivek Subramanian


2 Answers

You're on the right track with using heapq and Counter you just need to make a slight modification in how you are using them in relation to k: (you need to iterate the whole of counts before adding anything to result):

from collections import Counter
import heapq

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:
        counts = collections.Counter(words)
        max_heap = []
        for key, val in counts.items():
            heapq.heappush(max_heap, (-val, key))
        
        result = []
        while k > 0:
            result.append(heapq.heappop(max_heap)[1])
            k -= 1
        
        return result

Did not read the requirement of being O(N log k) before, here is a modification to the above solution to achieve that:

from collections import Counter, deque
import heapq

class WordWithFrequency(object):
    def __init__(self, word, frequency):
        self.word = word
        self.frequency = frequency

    def __lt__(self, other):
        if self.frequency == other.frequency:
            return lt(other.word, self.word)
        else:
            return lt(self.frequency, other.frequency)

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:    
        counts = collections.Counter(words)
        
        max_heap = []
        for key, val in counts.items():
            heapq.heappush(max_heap, WordWithFrequency(key, val))
            if len(max_heap) > k:
                heapq.heappop(max_heap)
        
        result = deque([]) # can also use a list and just reverse at the end
        while k > 0:
            result.appendleft(heapq.heappop(max_heap).word)
            k -= 1
        
        return list(result)
like image 92
Sash Sinha Avatar answered Oct 16 '22 14:10

Sash Sinha


You don't need to bother with a heap. Counter() already has a method to return the most common elements.

>>> c = Counter(["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"])
>>> c.most_common(4)
[('the', 3), ('is', 3), ('sunny', 2), ('day', 1)]
like image 29
Frank Yellin Avatar answered Oct 16 '22 12:10

Frank Yellin