Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Design a system to keep top k frequent words real time

Tags:

algorithm

Suppose we want a system to keep top k frequent words appear in tweets in last one hour. How to design it?

I can come up with hashmap, heap, log or MapReduce but I cannot find a very efficient way to do this.

Actually it's a question in an interview.
First I used a hashmap to count the frequency of each word. Also I kept a log, so as time passing by, I could count down the oldest words frequency.
Then I kept an entry array with length K(Top K array) and a number N which is the smallest count number in the array.
Every time a new word comes, I update the counting hashmap and get the count number of this new word. If it's larger than N, I will find if this word is in the array. If it's, I update that entry in the array. If not, I delete the smallest entry in the array and insert this new word into it. (Update N accordingly)

Here is the problem, my approach cannot deal with the deleting. I may need to iterate the entire counting hashmap to find the new top K.
Also, as the interviewer said, the system should get the result very fast. I think of several machines work together and each machine takes some words. However, how to combine the results becomes a problem too.

like image 964
Lampard Avatar asked Feb 11 '14 03:02

Lampard


2 Answers

If the words are not weighted (other than weights 0 and 1), then it is possible to derive a simple datastructure which maintains the word counts in order, using O(N) auxiliary storage where N is the number of unique words encountered in the sliding window (one hour, in the example). All operations (add a word, expire a word, look up the most frequent word) can be performed in O(1) time. Since any accurate solution needs to retain all the unique words in the sliding window, this solution is not asymptotically worse although the constant factor per word is not small.

The key to the solution is that the count for any given word can only be incremented or decremented by 1, and that all of the counts are integers. Consequently, it is possible to maintain a doubly-linked list of counts (in order) where each node in the list points to a double-linked list of words which have that count. In addition, each node in the word-list points back to the appropriate count node. Finally, we maintain a hashmap which allows us to find the node corresponding to a given word.

Finally, in order to decay the words at the end of their life, we need to retain the entire datastream from the sliding window, which has a size of O(N') where N' is the total number of words encountered during the sliding window. This can be stored as a singly-linked list where each node has a timestamp and a pointer to the unique word in the word-list.

When a word is encountered or expired, its count needs to be adjusted. Since the count can only be incremented or decremented by 1, the adjustment always consists in moving the word to the adjacent count-node (which may or may not exist); since the count-nodes are stored in a sorted linked list, the adjacent node can be found or created in time O(1). Furthermore, the most popular words (and counts) can always be traced in constant time by traversing the count list backwards from the maximum.

In case that was not obvious, here is a rough ascii art drawing of the datastructure at a given point in time:

Count list      word lists (each node points back to the count node)

  17            a <--> the <--> for
  ^
  |
  v
  12            Wilbur <--> drawing
  ^
  |
  v
  11            feature

Now, suppose we find a Wilbur. That will raise its count to 13; we can see from the fact that the success of 12 is not 13 that the 13 count node needs to be created and inserted into the count-list. After we do that, we remove Wilbur from its current word-list, put it into the newly-created empty word-list associated with the new count node, and change the count-pointer in Wilbur to point to the new count node.

Then, suppose that a use of drawing expires, so its new count will be 11. We can see from the fact that the predecessor of 12 is 11 that no new count node needs to be created; we simply remove drawing from its word-list and attach it to the word-list associate with 11, fixing its count-pointer as we do so. Now we notice that the word-list associated with 12 is empty, so we can remove the 12 count node from the count-list and delete it.

When the count for a word reaches 0, rather than attaching it to the 0 count node, which doesn't exist, we just delete the word node. And if a new word is encountered, we just add the word to the 1 count node, creating that count node if it doesn't exist.

In the worst case, every word has a unique count, so the size of the count-list cannot be greater than the number of unique words. Also, the total size of the word-lists is exactly the number of unique words because every word is in exactly one word-list, and fully-expired words don't appear in the word-lists at all.

--- EDIT

This algorithm is a bit RAM-hungry but it really shouldn't have any troubles holding an hour's worth of tweets. Or even a day's worth. And the number of unique words is not going to change much after a few days, even considering abbreviations and misspellings. Even so, it's worth thinking about ways to reduce the memory footprint and/or make the algorithm parallel.

To reduce the memory footprint, the easiest thing is to just drop words which are still unique after a few minutes. This will dramatically cut down on the unique word count, without altering the counts of popular words. Indeed, you could prune a lot more drastically without altering the final result.

To run the algorithm in parallel, individual words can be allocated to different machines by using a hash function to generate a machine number. (Not the same hash function as the one used to construct the hash tables.) Then the top k words can be found by merging the top k words from each machine; the allocation by hash guarantees that the set of words from each machine is distinct.

like image 139
rici Avatar answered Sep 21 '22 13:09

rici


This set of problems is called data stream algorithms. In your particular case there are two that fit - "Lossy Counting" and "Sticky Sampling" This is the paper that explains them or this, with pictures. This is a more simplified introduction.

Edit: (too long, to fit into a comment)

Although these streaming algos do not discount expired data per-se, one can run for instance 60 sliding windows, one for each minute of the hour and then delete and create a new one every minute. The sliding window on top is used for queering, other for updates only. This gives you a 1m resolution.

Critiques says, that streaming algos are probabilistic, and would not give you exact count, while this is true, please compare for instance with Rici's algo here, one does control error frequency, and can make it very low if desired. As you stream grows you would want to set it in % from the stream size, rather than in absolute value.

Streaming algos are very memory efficient, which is the most important things when crunching large streams in real time. Compare with Rici's precise algo which requires a single host to keep all data in memory for the current sliding window. It might not scale well - increase rate 100/s -> 100k/s or time window size 1h -> 7d and you will run out of memory on a single host.

Hastables that are essential part of the Rici's algo require one continuous memory blob which becomes more and more problematic as they grow.

like image 31
Igor Katkov Avatar answered Sep 22 '22 13:09

Igor Katkov