Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions. I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) no one mentions should be the hottest ones. Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment. Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem". My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones. I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it? I hope you can help me (coding examples would be great).

This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average. In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day. <pre class="prettyprint"><code>z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends] </code></pre> When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores. Please see Wikipedia for more information, about z-scores. Code <pre class="prettyprint"><code>from math import sqrt def zscore(obs, pop): # Size of population. number = float(len(pop)) # Average population value. avg = sum(pop) / number # Standard deviation of population. std = sqrt(sum(((c - avg) ** 2) for c in pop) / number) # Zscore Calculation. return (obs - avg) / std </code></pre> Sample Output <pre class="prettyprint"><code>>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9]) 3.5 >>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20]) 0.0739221270955 >>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]) 1.00303599234 >>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]) -0.922793112954 >>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]) 1.65291949506 </code></pre> Notes <ul> <li> You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time. </li> <li> You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph. </li> <li> If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this. <pre class="prettyprint"><code> from math import sqrt class zscore: def __init__(self, pop = []): self.number = float(len(pop)) self.total = sum(pop) self.sqrTotal = sum(x ** 2 for x in pop) def update(self, value): self.number += 1.0 self.total += value self.sqrTotal += value ** 2 def avg(self): return self.total / self.number def std(self): return sqrt((self.sqrTotal / self.number) - self.avg() ** 2) def score(self, obs): return (obs - self.avg()) / self.std() </code></pre> </li> <li> Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day. </li> </ul> New Addition Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details: Code <pre class="prettyprint"><code>class fazscore: def __init__(self, decay, pop = []): self.sqrAvg = self.avg = 0 # The rate at which the historic data's effect will diminish. self.decay = decay for x in pop: self.update(x) def update(self, value): # Set initial averages to the first value in the sequence. if self.avg == 0 and self.sqrAvg == 0: self.avg = float(value) self.sqrAvg = float((value ** 2)) # Calculate the average of the rest of the values using a # floating average. else: self.avg = self.avg * self.decay + value * (1 - self.decay) self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay) return self def std(self): # Somewhat ad-hoc standard deviation calculation. return sqrt(self.sqrAvg - self.avg ** 2) def score(self, obs): if self.std() == 0: return (obs - self.avg) * float("infinity") else: return (obs - self.avg) / self.std() </code></pre> Sample IO <pre class="prettyprint"><code>>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1) -1.67770595327 >>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9) 0.596052006642 >>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12) 3.46442230724 >>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22) 7.7773245459 >>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20) -0.24633160155 >>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20) 1.1069362749 >>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2) -0.786764452966 >>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9) 1.82262469243 >>> fazscore(0.8, [40] * 200).score(1) -inf </code></pre> Update As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line, <pre class="prettyprint"><code>if self.std() == 0: return 0 </code></pre> to: <pre class="prettyprint"><code>if self.std() == 0: return (obs - self.avg) * float("infinity") </code></pre> This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to: <pre class="prettyprint"><code>if self.std() == 0: return obs - self.avg </code></pre>

What is the best way to compute trending topics or tags?

Tags:

algorithm

information-retrieval

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.

I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) no one mentions should be the hottest ones.

Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment.

Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem".

My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones.

I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it?

I hope you can help me (coding examples would be great).

887

asked Apr 24 '09 20:04

caw

2 Answers

This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average.

In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.

z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]

When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.

Please see Wikipedia for more information, about z-scores.

Code

from math import sqrt  def zscore(obs, pop):     # Size of population.     number = float(len(pop))     # Average population value.     avg = sum(pop) / number     # Standard deviation of population.     std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)     # Zscore Calculation.     return (obs - avg) / std

Sample Output

>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9]) 3.5 >>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20]) 0.0739221270955 >>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]) 1.00303599234 >>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]) -0.922793112954 >>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]) 1.65291949506

Notes

You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.
You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.

If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.

  from math import sqrt    class zscore:       def __init__(self, pop = []):           self.number = float(len(pop))           self.total = sum(pop)           self.sqrTotal = sum(x ** 2 for x in pop)       def update(self, value):           self.number += 1.0           self.total += value           self.sqrTotal += value ** 2       def avg(self):           return self.total / self.number       def std(self):           return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)       def score(self, obs):           return (obs - self.avg()) / self.std()

Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day.

New Addition

Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:

Code

class fazscore:     def __init__(self, decay, pop = []):         self.sqrAvg = self.avg = 0         # The rate at which the historic data's effect will diminish.         self.decay = decay         for x in pop: self.update(x)     def update(self, value):         # Set initial averages to the first value in the sequence.         if self.avg == 0 and self.sqrAvg == 0:             self.avg = float(value)             self.sqrAvg = float((value ** 2))         # Calculate the average of the rest of the values using a          # floating average.         else:             self.avg = self.avg * self.decay + value * (1 - self.decay)             self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)         return self     def std(self):         # Somewhat ad-hoc standard deviation calculation.         return sqrt(self.sqrAvg - self.avg ** 2)     def score(self, obs):         if self.std() == 0: return (obs - self.avg) * float("infinity")         else: return (obs - self.avg) / self.std()

Sample IO

>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1) -1.67770595327 >>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9) 0.596052006642 >>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12) 3.46442230724 >>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22) 7.7773245459 >>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20) -0.24633160155 >>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20) 1.1069362749 >>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2) -0.786764452966 >>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9) 1.82262469243 >>> fazscore(0.8, [40] * 200).score(1) -inf

Update

As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line,

if self.std() == 0: return 0

to:

if self.std() == 0: return (obs - self.avg) * float("infinity")

This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to:

if self.std() == 0: return obs - self.avg

192

answered Sep 20 '22 22:09

21 revs, 2 users 99%

You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.

This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.

Normalize

One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.

Derive

Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.

This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.

Regarding the article

The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.

As Nixuz points out this is also referred to as a Z or Standard Score.

answered Sep 22 '22 22:09

Adam Davis

Related questions
                            
                                Difference between Divide and Conquer Algo and Dynamic Programming
                            
                                How to find the lowest common ancestor of two nodes in any binary tree?
                            
                                Non-recursive depth first search algorithm [closed]
                            
                                Explain how finding cycle start node in cycle linked list work?
                            
                                Are there any worse sorting algorithms than Bogosort (a.k.a Monkey Sort)? [closed]
                            
                                Rounding up to the nearest multiple of a number
                            
                                Algorithm to find Largest prime factor of a number
                            
                                JavaScript - get the first day of the week from current date
                            
                                Fastest way to flatten / un-flatten nested JSON objects
                            
                                Breadth First Vs Depth First
                            
                                How does the HyperLogLog algorithm work?
                            
                                Heap vs Binary Search Tree (BST)
                            
                                What are the mathematical/computational principles behind this game?
                            
                                Picking a random element from a set
                            
                                What exactly does big Ө notation represent?
                            
                                Given an array of numbers, return array of products of all other numbers (no division)
                            
                                Throwing the fattest people off of an overloaded airplane.
                            
                                Efficient way to apply multiple filters to pandas DataFrame or Series
                            
                                How to compare two colors for similarity/difference
                            
                                In-Place Radix Sort

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With