Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve my Algorithm to find the Hot-Topics like twitter does

Tags:

php

cron

I have created a cron job for my website which runs every 2hours and it counts the words in the feeds and then displays the 10 highest count words as the hot topics.

Something that Twitter does on their homepage, is to show the most popular topics that are being discussed.

What my cron job does right now is it counts the words except for the words that i have mentioned, words like:

array('of', 'a', 'an', 'also', 'besides', 'equally', 'further', 'furthermore', 'in', 'addition', 'moreover', 'too',
                        'after', 'before', 'when', 'while', 'as', 'by', 'the', 'that', 'since', 'until', 'soon', 'once', 'so', 'whenever', 'every', 'first', 'last',
                        'because', 'even', 'though', 'although', 'whereas', 'while', 'if', 'unless', 'only', 'whether', 'or', 'not', 'even',
                        'also', 'besides', 'equally', 'further', 'furthermore', 'addition', 'moreover', 'next', 'too',
                        'likewise', 'moreover', 'however', 'contrary', 'other', 'hand', 'contrast', 'nevertheless', 'brief', 'summary', 'short',
                        'for', 'example', 'for instance', 'fact', 'finally', 'in brief', 'in conclusion', 'in other words', 'in short', 'in summary', 'therefore',
                        'accordingly', 'as a result', 'consequently', 'for this reason', 'afterward', 'in the meantime', 'later', 'meanwhile', 'second', 'earlier', 'finally', 'soon', 'still', 'then', 'third');       //words that are negligible

But this does not completely solve the issue of eliminating all the non-required words. And give only the words that are useful.

Can someone please guide me on this, and tell me how can I improve my algorithm.

like image 330
Zeeshan Rang Avatar asked Dec 28 '09 21:12

Zeeshan Rang


People also ask

Does Twitter have a content algorithm?

As I just mentioned, the key here is personalizing users' Twitter feeds to fit them and their interests. The Twitter algorithm bases its curating of content on your activity on the platform. As you interact on Twitter, you'll like certain tweets, follow certain accounts, and retweet certain things you like.

How do I find topics on Twitter?

Go to the more icon and tap or click on Topics. A popup with options will appear. Tap or click on Topics. If you are following any Topics, they will appear here.


1 Answers

If you want the statically significant outliers you may want to calculate a z-score for each word in a recent subset relative to the overall text.

So if

t is number of occurrences of word in subset
o is number of occurrences of word overall
n_t is number of words in subset
n_o is number of words overall

then calculate:

p_hat = t / n_t
p_0 = o / n_o

z = (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n_t)

The higher the z, the more statistically significant the mention of the word in the subset is relative to the overall text. This can also be used to calculate words that are oddly rare in the subset relative to the overall text.

like image 121
James Tauber Avatar answered Nov 01 '22 12:11

James Tauber