I saw this question, which focuses on the "Brittney Spears" problem. But I have a bit of a different question. How does the algorithm determine which words or phrases need to be ranked? For instance, if I send out a tweet that says "Michael Jackson died", how does it know to pull out "Michael Jackson" but not "died"?
Or suppose that Alec Baldwin and Steven Baldwin were in the news that day and thus were both mentioned in a lot of tweets. How would it know to treat both names differently instead of just pulling out "Baldwin"?
Done naively, I could see this problem as being NP-complete (you'd have to compare all potential phrases in the tweet with all potential phrases in everyone else's tweets).
A general solution to this problem is with "term frequency, inverse document frequency" (tf-idf).
It is a statistical approach which finds words/terms that are more relevant than others because they're not seen very often. In this case, the name "Michael Jackson" may have very low frequency compared to a common English word "died".
As for the Alec Baldwin vs. Steven Baldwin - these would be identified as separate during part-of-speech tagging - they would tagged as individual proper nouns.
I believe it looks for common sets of words. Also, it appears that they are referencing http://www.whatthetrend.com/
In addition to this, there might be a slight human control involved too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With