How can I measure trends in certain words, like Twitter?

Tags:

I have newspaper articles' corpus by day. Each word in the corpus has a frequency count of being present that day. I have been toying with finding an algorithm that captures the break-away words, similar to the way Twitter measures Trends in people's tweets.

For Instance, say the word 'recession' appears with the following frequency in the same group of newspapers:
Day 1 | recession | 456
Day 2 | recession | 2134
Day 3 | recession | 3678

While 'europe'
Day 1 | europe | 67895
Day 2 | europe | 71999
Day 3 | europe | 73321

I was thinking of taking the % growth per day and multiplying it by the log of the sum of frequencies. Then I would take the average to score and compare various words.

In this case:
recession = (3.68*8.74+0.72*8.74)/2 = 19.23
europe = (0.06*12.27+0.02*12.27)/2 = 0.49

Is there a better way to capture the explosive growth? I'm trying to mine the daily corpus to find terms that are more and more mentioned in a specific time period across time. PLEASE let me know if there is a better algorithm. I want to be able to find words with high non-constant acceleration. Maybe taking the second derivative would be more effective. Or maybe I'm making this way too complex and watched too much physics programming on the discovery channel. Let me know with a math example if possible Thanks!

732

asked Dec 13 '11 00:12

datayoda

1 Answers

First thing to notice is that this can be approximated by a local problem. That is to say, a "trending" word really depends only upon recent data. So immediately we can truncate our data to the most recent N days where N is some experimentally determined optimal value. This significantly cuts down on the amount of data we have to look at.

In fact, the NPR article suggests this.

Then you need to somehow look at growth. And this is precisely what the derivative captures. First thing to do is normalize the data. Divide all your data points by the value of the first data point. This makes it so that the large growth of an infrequent word isn't drowned out by the relatively small growth of a popular word.

For the first derivative, do something like this:

d[i] = (data[i] - data[i+k])/k

for some experimentally determined value of k (which, in this case, is a number of days). Similarly, the second derivative can be expressed as:

d2[i] = (data[i] - 2*data[i+k] + data[i+2k])/(2k)

Higher derivatives can also be expressed like this. Then you need to assign some kind of weighting system for these derivatives. This is a purely experimental procedure which really depends on what you want to consider "trending." For example, you might want to give acceleration of growth half as much weight as the velocity. Another thing to note is that you should try your best to remove noise from your data because derivatives are very sensitive to noise. You do this by carefully choosing your value for k as well as discarding words with very low frequencies altogether.

I also notice that you multiply by the log sum of the frequencies. I presume this is to give the growth of popular words more weight (because more popular words are less likely to trend in the first place). The standard way of measuring how popular a word is is by looking at it's inverse document frequency (IDF).

I would divide by the IDF of a word to give the growth of more popular words more weight.

IDF[word] = log(D/(df[word))

where D is the total number of documents (e.g. for Twitter it would be the total number of tweets) and df[word] is the number of documents containing word (e.g. the number of tweets containing a word).

A high IDF corresponds to an unpopular word whereas a low IDF corresponds to a popular word.

123

answered Oct 21 '22 15:10

tskuzzy

Related questions
                            
                                Is there any way to efficiently reconstruct a collection based on a sequence of inserts/removals?
                            
                                optimal negative space between rectangles algorithm?
                            
                                How Does Facebook Determine "Suggested Friends"? [closed]
                            
                                What's the equivalent 'nth_element' function in Java?
                            
                                dynamic programming algorithm during an interview [closed]
                            
                                Snake-alike fluid layout algorithm
                            
                                efficiently find the first element matching a bit mask
                            
                                Avoid collision between nodes and edges in D3 force layout
                            
                                How can I figure out which tiles move and merge in my implementation of 2048?
                            
                                How to balance number of ratings versus the ratings themselves?
                            
                                How does a GPS receiver synchronize its quartz clock with GPS satellites? [closed]
                            
                                How to find feedback edge set in undirected graph
                            
                                Longest matching substring irrespective of the order of characters
                            
                                Calculate all possibilities to get N using values from a given set [duplicate]
                            
                                What algorithms exist to minimize the number of transactions between nodes in a graph?
                            
                                Finding largest subset of points forming a convex polygon
                            
                                Maximize number of subgraphs with a given minimum weight
                            
                                opencv: Best way to detect corners on chessboard
                            
                                How to improve performance of matching algorithm
                            
                                Packing arbitrary polygons within an arbitrary boundary

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I measure trends in certain words, like Twitter?

Tags:

algorithm

math

geometry

statistics

linear-algebra

datayoda

People also ask

1 Answers

tskuzzy

Recent Activity

Donate For Us