Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding algorithms for measuring trends

Tags:

algorithm

What's the rationale behind the formula used in the hive_trend_mapper.py program of this Hadoop tutorial on calculating Wikipedia trends?

There are actually two components: a monthly trend and a daily trend. I'm going to focus on the daily trend, but similar questions apply to the monthly one.

In the daily trend, pageviews is an array of number of page views per day for this topic, one element per day, and total_pageviews is the sum of this array:

# pageviews for most recent day
y2 = pageviews[-1]
# pageviews for previous day
y1 = pageviews[-2]
# Simple baseline trend algorithm
slope = y2 - y1
trend = slope  * log(1.0 +int(total_pageviews))
error = 1.0/sqrt(int(total_pageviews))
return trend, error

I know what it's doing superficially: it just looks at the change over the past day (slope), and scales this up to the log of 1+total_pageviews (log(1)==0, so this scaling factor is non-negative). It can be seen as treating the month's total pageviews as a weight, but tempered as it grows - this way, the total pageviews stop making a difference for things that are "popular enough," but at the same time big changes on insignificant don't get weighed as much.

But why do this? Why do we want to discount things that were initially unpopular? Shouldn't big deltas matter more for items that have a low constant popularity, and less for items that are already popular (for which the big deltas might fall well within a fraction of a standard deviation)? As a strawman, why not simply take y2-y1 and be done with it?

And what would the error be useful for? The tutorial doesn't really use it meaningfully again. Then again, it doesn't tell us how trend is used either - this is what's plotted in the end product, correct?

Where can I read up for a (preferably introductory) background on the theory here? Is there a name for this madness? Is this a textbook formula somewhere?

Thanks in advance for any answers (or discussion!).

like image 415
Yang Avatar asked Oct 28 '09 07:10

Yang


People also ask

How to determine which algorithm is better?

To determine which algorithm is faster for a given dataset size, you need to tune each one's performance until it is "as fast as possible" and then see which one wins. Performance tuning requires profiling, or single-stepping at the instruction level, or my favorite technique, stackshots.


1 Answers

As the in-line comment goes, this is a simple "baseline trend algorithm", which basically means before you compare the trends of two different pages, you have to establish a baseline. In many cases, the mean value is used, it's straightforward if you plot the pageviews against the time axis. This method is widely used in monitoring water quality, air pollutants, etc. to detect any significant changes w.r.t the baseline.

In OP's case, the slope of pageviews is weighted by the log of totalpageviews. This sorta uses the totalpageviews as a baseline correction for the slope. As Simon put it, this puts a balance between two pages with very different totalpageviews. For exmaple, A has a slope 500 over 1000,000 total pageviews, B is 1000 over 1,000. A log basically means 1000,000 is ONLY twice more important than 1,000 (rather than 1000 times). If you only consider the slope, A is less popular than B. But with a weight, now the measure of popularity of A is the same as B. I think it is quite intuitive: though A's pageviews is only 500 pageviews, but that's because it's saturating, you still gotta give it enough credit.

As for the error, I believe it comes from the (relative) standard error, which has a factor 1/sqrt(n), where n is the number of data points. In the code, the error is equal to (1/sqrt(n))*(1/sqrt(mean)). It roughly translates into : the more data points, the more accurate the trend. I don't see it is an exact math formula, just a brute trend analysis algorithm, anyway the relative value is more important in this context.

In summary, I believe it's just an empirical formula. More advanced topics can be found in some biostatistics textbooks (very similar to monitoring the breakout of a flu or the like.)

like image 172
Dingle Avatar answered Nov 16 '22 04:11

Dingle