Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Realtime tracking of top 100 twitter words per min/hour/day

Tags:

algorithm

nlp

I recently came across this interview question:

Given a continuous twitter feed, design an algorithm to return the 100 most
frequent words used at this minute, this hour and this day.

I was thinking of a system with a hash map of word -> count linked to 3 min-heaps for the current min, hour and day.

Every incoming message is tokenized, sanitized and the word counts updated in the hash map (and increase-key in the heaps if the word already exists in it)

If any of the words don't exist in the heap (and heap size == 100) check if their frequency > min value in the heap and if so then extract-min and insert into the heap.

Are there better ways of doing this?

like image 402
barefootshoes Avatar asked Apr 17 '12 10:04

barefootshoes


People also ask

How many tweets are sent per minute on Twitter?

Every second, on average, around 6,000 tweets are tweeted on Twitter (visualize them here), which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day and around 200 billion tweets per year.

What is the real time Twitter map?

This is the real time map of tweets coming from various locations. The tweets shown on the map are only the fresh tweets. They are displayed as soon as received. No tweet location is older than a fraction of a minute.

What are some interesting statistics about Twitter?

Twitter Usage Statistics. The first tweet was sent on March 21, 2006 by Jack Dorsey, the creator of Twitter. It took over three years, until the end of May 2009, to reach the billionth tweet. [1] Today, it takes less than two days for one billion tweets to be sent. In Twitter's short history, we went from 5,000 tweets per day in 2007 [2]...

How many times a day do top brands tweet?

Top media brands produced an average of 9.54 tweets per day in 2019. 92 percent of firms tweet more than once every day, 42 percent do it one to 5 times, and nineteen percent tweet 6 to ten times. Hotels and Resorts brands created an average of 0.44 tweets daily in 2019.


2 Answers

Your algorithm is a good start, but it is not going to produce correct results. The problem is that the hash tables the way you describe them are a one-way street: once a word gets added, it stays counted forever.

You need an array of 1440 (24*60) word+count hash maps organized the way that you describe; these are your minute-by-minute counts. You need two additional hash maps - for the rolling total of the hour and the day.

Define two operations on hash maps - add and subtract, with the semantic of merging counts of identical words, and removing words when their count drops to zero.

Each minute you start a new hash map, and update counts from the feed. At the end of the minute, you place that hash map into the array for the current minute, add it to the rolling total for the hour and for the day, and then subtract the hash map of an hour ago from the hourly running total, and subtract the hash map of 24 hours ago from the daily running total.

Finally, you need a way to produce the top 100 words given a hash map. This should be a trivial task - add items to an array of word+count entries, sort on the count, and keep the top 100.

like image 172
Sergey Kalinichenko Avatar answered Sep 21 '22 01:09

Sergey Kalinichenko


dasblinkenlight made a good point for a mistake of not excluding items out of your hash map.

There is one more thing to add though, to actually compute the top K words given a min/hour/day, it is faster to use partition (O(n)) rather than sorting (O(nlgn)):

  1. dump a hashmap of a min/hour/day's word counts into an array: O(n)
  2. use median-of-median selection to get the K-th elem: O(n)
  3. partition around the K-th elem: O(n)

HTH.

like image 23
galactica Avatar answered Sep 23 '22 01:09

galactica