I recently came across this interview question:
Given a continuous twitter feed, design an algorithm to return the 100 most
frequent words used at this minute, this hour and this day.
I was thinking of a system with a hash map of word -> count
linked to 3 min-heaps for the current min, hour and day.
Every incoming message is tokenized, sanitized and the word counts updated in the hash map (and increase-key in the heaps if the word already exists in it)
If any of the words don't exist in the heap (and heap size == 100) check if their frequency > min value
in the heap and if so then extract-min and insert into the heap.
Are there better ways of doing this?
Every second, on average, around 6,000 tweets are tweeted on Twitter (visualize them here), which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day and around 200 billion tweets per year.
This is the real time map of tweets coming from various locations. The tweets shown on the map are only the fresh tweets. They are displayed as soon as received. No tweet location is older than a fraction of a minute.
Twitter Usage Statistics. The first tweet was sent on March 21, 2006 by Jack Dorsey, the creator of Twitter. It took over three years, until the end of May 2009, to reach the billionth tweet. [1] Today, it takes less than two days for one billion tweets to be sent. In Twitter's short history, we went from 5,000 tweets per day in 2007 [2]...
Top media brands produced an average of 9.54 tweets per day in 2019. 92 percent of firms tweet more than once every day, 42 percent do it one to 5 times, and nineteen percent tweet 6 to ten times. Hotels and Resorts brands created an average of 0.44 tweets daily in 2019.
Your algorithm is a good start, but it is not going to produce correct results. The problem is that the hash tables the way you describe them are a one-way street: once a word gets added, it stays counted forever.
You need an array of 1440
(24*60) word+count
hash maps organized the way that you describe; these are your minute-by-minute counts. You need two additional hash maps - for the rolling total of the hour and the day.
Define two operations on hash maps - add
and subtract
, with the semantic of merging counts of identical words, and removing words when their count drops to zero.
Each minute you start a new hash map, and update counts from the feed. At the end of the minute, you place that hash map into the array for the current minute, add it to the rolling total for the hour and for the day, and then subtract the hash map of an hour ago from the hourly running total, and subtract the hash map of 24 hours ago from the daily running total.
Finally, you need a way to produce the top 100 words given a hash map. This should be a trivial task - add items to an array of word+count
entries, sort on the count, and keep the top 100.
dasblinkenlight made a good point for a mistake of not excluding items out of your hash map.
There is one more thing to add though, to actually compute the top K words given a min/hour/day, it is faster to use partition (O(n)) rather than sorting (O(nlgn)):
HTH.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With