I'm creating an iterative algorithm (Monte Carlo method). The algorithm returns a value on every iteration, creating a stream of values.
I need to analyze these values and stop the algorithm when say 1000
returned values are withing some epsilon
.
I decided to implement it calculation the max
and min
values of the last 1000
values, and then calculate the error
using this formula (max-min)/min
and compare it to epsilon
: error<=epsilon
. And if this condition is reached, stop the iterations and return the result.
The first hare-brained idea was to use a list
and append
new values to it, calculating the max
and min
values for the last 1000
values of it after each appending.
Then I decided there is no use of keeping more that 1000
last values. So I remembered of deque
. It was a very good idea since the complexity on adding and deleting on both ends of deque
object is O(1)
. But it didn't solve the problem of needing to go through all the last 1000 values on each iteration to calculate min
and max
.
Then I remembered there is the heapq
module. It keeps the data in such a way as to efficiently return the smallest one at every moment. But I need both the smallest and the largest ones. Furthermore I need to preserve the order of the elements so that I can keep 1000
last returned elements of the algorithm, and I don't see how I can achieve it with heapq
.
Having all those thoughts in mind I decided to ask here:
How can I solve this task the most efficiently?
If you are free / willing to change your definition of error
, you might want to consider using the variance
instead of (max-min)/min
.
You can compute the variance incrementally.
True, using this method, you are not deleting any values from your stream -- the variance will depend on all the values. But so what? With enough values, the first few won't matter a great deal to the variance, and the variance of the average, variance/n
, will become small when enough values cluster around some fixed value.
So, you can choose to halt when the variance/n < epsilon
.
As a refinement of @unutbu's excellent idea, you could consider using exponentially-weighted moving variance. It can be computed in O(1)
time per observation, requires O(1)
space, and has the advantage of automatically reducing the observation's weight as the observation gets older.
The following paper has the relevant formulae: link. See equations (140)-(143) therein.
Finally, you might want to work with the standard deviation instead of variance. It is simply the square root of variance, and has the advantage of having the same units as the original data. This should make it easier to formulate a meaningful stopping criterion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With