Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When calculating trends, how do you account for low sample size?

I'm doing some work processing some statistics for home approvals in a given month. I'd like to be able to show trends - that is, which areas have seen a large relative increase or decrease since the last month(s).

My first naive approach was to just calculate the percentage change between two months, but that has problems when the data is very low - any change at all is magnified:

// diff = (new - old) / old
     Area      |  June  |  July  |  Diff  |
 --------------|--------|--------|--------|
 South Sydney  |   427  |   530  |  +24%  |
 North Sydney  |   167  |   143  |  -14%  |
 Dubbo         |     1  |     3  | +200%  |

I don't want to just ignore any area or value as an outlier, but I don't want Dubbo's increase of 2 per month to outshine the increase of 103 in South Sydney. Is there a better equation I could use to show more useful trend information?

This data is eventually being plotted on Google Maps. In this first attempt, I'm just converting the difference to a "heatmap colour" (blue - decrease, green - no change, red - increase). Perhaps using some other metric to alter the view of each area might be a solution, for example, change the alpha channel based on the total number of approvals or something similar, in this case, Dubbo would be bright red, but quite transparent, whereas South Sydney would be closer to yellow but quite opaque.

Any ideas on the best way to show this data?

like image 267
nickf Avatar asked Dec 22 '22 08:12

nickf


1 Answers

Look into measures of statistical significance. It could be as simple as assuming counting statistics.

In a very simple minded version, the thing you plot is

 (A_2 - A_1)/sqrt(A_2 + A_1)

i.e. change over 1 sigma in simple counting statistics.

Which makes the above chart look like:

Area    Reduced difference
--------------------------
S.S.    +3.3  
N.S.    -1.3  
D.      +1.0

which is interpreted as meaning that South Sydney has experienced a significant (i.e. important, and possibly related to a real underlying cause) increasing, while North Sydney and Dubbo felt relatively minor changes that may or may not be point to a trend. Rule of thumb

  • 1 sigma changes are just noise
  • 3 sigma changes probably point to a underlying cause (and therefore the expectation of a trend)
  • 5 sigma changes almost certainly point to a trend

Areas with very low rates (like Dubbo) will still be volatile, but they won't overwhelm the display.

like image 200
dmckee --- ex-moderator kitten Avatar answered Dec 26 '22 11:12

dmckee --- ex-moderator kitten