Program is receiving approximately 50,000 numbers every second.
At ANY given moment, I need to calculate minimum, maximum and average of the values (numbers) that arrived in the last second (regarding to given moment).
Is there a way to do this without using array or list (buffer) to store arriving numbers and to calculate results?
If I need to use buffer, what would be the efficient way to achieve this?
(Note that numbers from buffer also must be efficiently removed from time to time)
Keep a minimum variable that is initialized to a high value, and update it if you see a lower value. Do the opposite with a maximum variable. Add up all numbers and divide that sum by the total count to get the average.
The Average function calculates the average, or arithmetic mean, of its arguments. The Max function finds the maximum value. The Min function finds the minimum value. The Sum function calculates the sum of its arguments.
Min and max: Shows you the lowest (minimum) and highest (maximum) values in your column. Mean: Also called the average. The sum of all the values in your column, divided them by the total number of values. Median: The number that would be in the middle of an ordered list of your values.
Here is an algorithm that will somewhat work to save efficiency in certain cases:
As events come in, buffer them completely, and calculate a running sum
, count
, min
, max
(trivial).
When a request for average
, min
, or max
is made, loop through from the back of the buffer and start removing values older than one second. Subtract from sum
and count
as you go.
If the values are all above min
you can keep your min
. If the values are below max
, you can keep your max
. In this scenario, you have average
, min
, and max
updated efficiently.
If the values are below min
or above max
, you'll need to loop through the rest of the array and recalculate it.
Do step two once a second or so also so that the buffer doesn't get too full. This code could be performed on every buffer insert also, or wherever made sense.
Best structure for this kind of work is a circular buffer, to avoid memory allocations and GC getting in the way. It should be large enough to cover the worst case scenario for message size per second.
Updates
Depending on the usage scenario one other thing to do would be to run the algorithm above but in 10 x 100ms chunks rather than 1 x 1000ms piece. That is, keep the running min, max, sum and count on those 10 chunks. Then when you reach an 'invalidation' scenario, you generally only need to look through the latest 100ms of data or a quick pass through the min and max of the other 9 chunks.
@ja72 provided a great idea to save on finding the min and max values if they are invalidated:
Instead of keeping the min/max values x_min, x_max keep instead the index of where they are located in the x[i] array with i_min and i_max. Then finding them can be trivial sometimes, but when the last value considered holds the min and max, the entire list needs to be scanned to establish the new limits.
Sam Holder had another good idea in the comments - keep a parallel array that is always sorted, this lets you lop numbers off the top or bottom to find new minimums and maximums easier. However, insert speed here is compromised a little (needs to remain in order).
Ultimately, the right choice will depend on the usage characteristics of the program. How often will values be read vs how often they are inserted?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With