Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I measure trends in certain words, like Twitter?

I have newspaper articles' corpus by day. Each word in the corpus has a frequency count of being present that day. I have been toying with finding an algorithm that captures the break-away words, similar to the way Twitter measures Trends in people's tweets.

For Instance, say the word 'recession' appears with the following frequency in the same group of newspapers:
Day 1 | recession | 456
Day 2 | recession | 2134
Day 3 | recession | 3678

While 'europe'
Day 1 | europe | 67895
Day 2 | europe | 71999
Day 3 | europe | 73321

I was thinking of taking the % growth per day and multiplying it by the log of the sum of frequencies. Then I would take the average to score and compare various words.

In this case:
recession = (3.68*8.74+0.72*8.74)/2 = 19.23
europe = (0.06*12.27+0.02*12.27)/2 = 0.49

Is there a better way to capture the explosive growth? I'm trying to mine the daily corpus to find terms that are more and more mentioned in a specific time period across time. PLEASE let me know if there is a better algorithm. I want to be able to find words with high non-constant acceleration. Maybe taking the second derivative would be more effective. Or maybe I'm making this way too complex and watched too much physics programming on the discovery channel. Let me know with a math example if possible Thanks!

like image 732
datayoda Avatar asked Dec 13 '11 00:12

datayoda


People also ask

How do you see if a word is trending on Twitter?

On Twitter's mobile apps, you can find Trends listed under the Trends section of the Explore tab when signed in to twitter.com on a desktop or laptop computer, Trends are listed in many places, including the Home timeline, Notifications, search results, and profile pages.

Which is the most suitable technique for finding trending topic on Twitter?

The Explore tab can help you understand what topics people are talking about on Twitter. Use the “Advanced search” function to narrow your search to Tweets in specific geographies, during certain dates, or even Tweets with a specific sentiment.

What counts as trending on Twitter?

Trending topics on Twitter are determined by the platform's algorithm. They can be tailored to the people you interact with, your interests, and your location. Depending on the view you select, you can see either what's popular on a broader scale, or what's trending within your specific interest groups.


1 Answers

First thing to notice is that this can be approximated by a local problem. That is to say, a "trending" word really depends only upon recent data. So immediately we can truncate our data to the most recent N days where N is some experimentally determined optimal value. This significantly cuts down on the amount of data we have to look at.

In fact, the NPR article suggests this.

Then you need to somehow look at growth. And this is precisely what the derivative captures. First thing to do is normalize the data. Divide all your data points by the value of the first data point. This makes it so that the large growth of an infrequent word isn't drowned out by the relatively small growth of a popular word.

For the first derivative, do something like this:

d[i] = (data[i] - data[i+k])/k

for some experimentally determined value of k (which, in this case, is a number of days). Similarly, the second derivative can be expressed as:

d2[i] = (data[i] - 2*data[i+k] + data[i+2k])/(2k)

Higher derivatives can also be expressed like this. Then you need to assign some kind of weighting system for these derivatives. This is a purely experimental procedure which really depends on what you want to consider "trending." For example, you might want to give acceleration of growth half as much weight as the velocity. Another thing to note is that you should try your best to remove noise from your data because derivatives are very sensitive to noise. You do this by carefully choosing your value for k as well as discarding words with very low frequencies altogether.

I also notice that you multiply by the log sum of the frequencies. I presume this is to give the growth of popular words more weight (because more popular words are less likely to trend in the first place). The standard way of measuring how popular a word is is by looking at it's inverse document frequency (IDF).

I would divide by the IDF of a word to give the growth of more popular words more weight.

IDF[word] = log(D/(df[word))

where D is the total number of documents (e.g. for Twitter it would be the total number of tweets) and df[word] is the number of documents containing word (e.g. the number of tweets containing a word).

A high IDF corresponds to an unpopular word whereas a low IDF corresponds to a popular word.

like image 123
tskuzzy Avatar answered Oct 21 '22 15:10

tskuzzy