Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to notice unusual news activity

Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".

What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?

I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.

update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.

like image 833
ʞɔıu Avatar asked Jun 11 '10 15:06

ʞɔıu


1 Answers

You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.

You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :

Name                m    m0    δ    z
Steve Jobs       4950  4500    .10      495
Steve Ballmer     400   300    .33      132
Larry Ellison      50    10    4.0      400
Andy Nobody        50    40    .20      10

Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.

like image 179
Andy Avatar answered Sep 22 '22 18:09

Andy