Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What statistics can be maintained for a set of numerical data without iterating?

Update

Just for future reference, I'm going to list all of the statistics that I'm aware of that can be maintained in a rolling collection, recalculated as an O(1) operation on every addition/removal (this is really how I should've worded the question from the beginning):

Obvious

  • Count
  • Sum
  • Mean
  • Max*
  • Min*
  • Median**

Less Obvious

  • Variance
  • Standard Deviation
  • Skewness
  • Kurtosis
  • Mode***
  • Weighted Average
  • Weighted Moving Average****

OK, so to put it more accurately: these are not "all" of the statistics I'm aware of. They're just the ones that I can remember off the top of my head right now.

*Can be recalculated in O(1) for additions only, or for additions and removals if the collection is sorted (but in this case, insertion is not O(1)). Removals potentially incur an O(n) recalculation for non-sorted collections.

**Recalculated in O(1) for a sorted, indexed collection only.

***Requires a fairly complex data structure to recalculate in O(1).

****This can certainly be achieved in O(1) for additions and removals when the weights are assigned in a linearly descending fashion. In other scenarios, I'm not sure.


Original Question

Say I maintain a collection of numerical data -- let's say, just a bunch of numbers. For this data, there are loads of calculated values that might be of interest; one example would be the sum. To get the sum of all this data, I could...

Option 1: Iterate through the collection, adding all the values:

double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];

Option 2: Maintain the sum, eliminating the need to ever iterate over the collection just to find the sum:

void Add(double value) {
    values.Add(value);
    sum += value;
}

void Remove(double value) {
    values.Remove(value);
    sum -= value;
}

EDIT: To put this question in more relatable terms, let's compare the two options above to a (sort of) real-world situation:

Suppose I start listing numbers out loud and ask you to keep them in your head. I start by saying, "11, 16, 13, 12." If you've just been remembering the numbers themselves and nothing more, and then I say, "What's the sum?", you'd have to think to yourself, "OK, what's 11 + 16 + 13 + 12?" before responding, "52." If, on the other hand, you had been keeping track of the sum yourself while I was listing the numbers (i.e., when I said, "11" you thought "11", when I said "16", you thought, "27," and so on), you could answer "52" right away. Then if I say, "OK, now forget the number 16," if you've been keeping track of the sum inside your head you can simply take 16 away from 52 and know that the new sum is 36, rather than taking 16 off the list and them summing up 11 + 13 + 12.

So my question is, what other calculations, other than the obvious ones like sum and average, are like this?


SECOND EDIT: As an arbitrary example of a statistic that (I'm almost certain) does require iteration -- and therefore cannot be maintained as simply as a sum or average -- consider if I asked you, "how many numbers in this collection are divisible by the min?" Let's say the numbers are 5, 15, 19, 20, 21, 25, and 30. The min of this set is 5, which divides into 5, 15, 20, 25, and 30 (but not 19 or 21), so the answer is 5. Now if I remove 5 from the collection and ask the same question, the answer is now 2, since only 15 and 30 are divisible by the new min of 15; but, as far as I can tell, you cannot know this without going through the collection again.

So I think this gets to the heart of my question: if we can divide kinds of statistics into these categories, those that are maintainable (my own term, maybe there's a more official one somewhere) versus those that require iteration to compute any time a collection is changed, what are all the maintainable ones?

What I am asking about is not strictly the same as an online algorithm (though I sincerely thank those of you who introduced me to that concept). An online algorithm can begin its work without having even seen all of the input data; the maintainable statistics I am seeking will certainly have seen all the data, they just don't need to reiterate through it over and over again whenever it changes.

like image 328
Dan Tao Avatar asked Oct 15 '09 18:10

Dan Tao


2 Answers

First, the term that you want here is online algorithm. All moments (mean, standard deviation, skew, etc.) can be calculated online. Others include the minimum and maximum. Note that median and mode can not be calculated online.

like image 55
jason Avatar answered Oct 26 '22 14:10

jason


To consistently maintain the high/low you store your data in sorted order. There are algorithms for maintaining data structures which preserves ordering.

Median is trivial if the data is ordered.

If the data is reduced slightly to a frequency table, you can maintain mode. If you keep your data as a random, flat list of values, you can't easily compute mode in the presence of change.

like image 33
S.Lott Avatar answered Oct 26 '22 14:10

S.Lott