Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does twitter's trending topics algorithm decide which words to extract from tweets?

I saw this question, which focuses on the "Brittney Spears" problem. But I have a bit of a different question. How does the algorithm determine which words or phrases need to be ranked? For instance, if I send out a tweet that says "Michael Jackson died", how does it know to pull out "Michael Jackson" but not "died"?

Or suppose that Alec Baldwin and Steven Baldwin were in the news that day and thus were both mentioned in a lot of tweets. How would it know to treat both names differently instead of just pulling out "Baldwin"?

Done naively, I could see this problem as being NP-complete (you'd have to compare all potential phrases in the tweet with all potential phrases in everyone else's tweets).

like image 234
Jason Baker Avatar asked Jan 03 '10 19:01

Jason Baker


2 Answers

A general solution to this problem is with "term frequency, inverse document frequency" (tf-idf).

It is a statistical approach which finds words/terms that are more relevant than others because they're not seen very often. In this case, the name "Michael Jackson" may have very low frequency compared to a common English word "died".

As for the Alec Baldwin vs. Steven Baldwin - these would be identified as separate during part-of-speech tagging - they would tagged as individual proper nouns.

like image 195
James Kolpack Avatar answered Sep 28 '22 18:09

James Kolpack


I believe it looks for common sets of words. Also, it appears that they are referencing http://www.whatthetrend.com/

In addition to this, there might be a slight human control involved too.

like image 36
Daniel A. White Avatar answered Sep 28 '22 19:09

Daniel A. White