I have a few million datapoints, each with a time and a value. I'm interested in knowing all of the sliding windows, (ie, chunks of 4000 datapoints) where the range from high to low of the window exceeds a constant threshold.
For example:, assume a window of length 3, and a threshold where high - low > 3. Then the series: [10 12 14 13 10 11 16 14 17] would result in [0, 2, 4, 5] because those are the indexes where the 3 period window's high - low range exceeded the threshold.
I have a window size of 4000 and a dataset size of millions.
The naive approach is to just calculate every possible window range, ie 1-4000, 2-4001, 3-4002, etc, and accumulate those sets that breached the threshold. This takes forever as you might imagine for large datasets.
So, the algorithm I think would be better is the following:
Calculate the range of the first window (1-4000), and store the index of the high/low of the window range. Then, iterate to (2-4001, 3-4002) etc. Only update the high/low index if the NEW value on the far right of the window is higher/lower than the old cached value.
Now, let's say the high/low indexes of the 1-4000 window is 333 and 666 respectively. I iterate and continue updating new highs/lows as I see them on the right, but as soon as the window is at 334-4333 (as soon as the cached high/low is outside of the current window) I recalculate the high/low for the current window (334-4333), cache, and continue iterating.
My question is:
1.) Is there a mathematical formula for this that eliminates the need for an algorithm at all? I know there are formulas for weighted and exponential moving averages over a window period that don't require recalculation of the window.
2.) Is my algorithm sensible? Accurate? Is there a way it could be greatly simplified or improved?
Thanks a lot.
If the data length is n and window size m, then here's an O(n log m) solution using sorted-maps.
(defn freqs
"Like frequencies but uses a sorted map"
[coll]
(reduce (fn [counts x]
(assoc counts x (inc (get counts x 0))))
(sorted-map) coll))
(defn rng
"Return max - min value of a sorted-map (log time)"
[smap]
(- (ffirst (rseq smap)) (ffirst smap)))
(defn slide-threshold [v w t]
(loop [q (freqs (subvec v 0 w)), i 0, j (+ i w), a []]
(if (= (count v) j)
a
(let [q* (merge-with + q {(v i) -1} {(v j) 1})
q* (if (zero? (q* (v i))) (dissoc q* (v i)) q*)
a* (if (> (rng q) t) (conj a i) a)]
(recur q* (inc i) (inc j) a*)))))
(slide-threshold [10 12 14 13 10 11 16 14 17] 3 3)
;=> [0 2 4 5]
The naive version is not linear. Linear would be O(n). The naive algorithm is O(n*k), where k is the window size. Your improvement also is O(n * k) in the worst case (imagine a sorted array), but in the general case you should see a big improvement in running time because you'll avoid a large number of recalculations.
You can solve this in O(n log k) by using a Min-max heap (or two heaps), but you have to use a type of heap that can remove an arbitrary node in O(log k). You can't use a standard binary heap because although removing an arbitrary node is O(log k), finding the node is O(k).
Assuming you have a Min-max heap, the algorithm looks like this:
heap = create empty heap
add first k items to the heap
for (i = k; i < n-k; ++i)
{
if (heap.MaxItem - heap.MinItem) > threshold
output range
remove item i-k from the heap
add item i to the heap
}
The problem, of course, is removing item i-k from the heap. Actually, the problem is finding it efficiently. The way I've done this in the past is to modify my binary heap so that it stores nodes that contain an index and a value. The heap comparisons use the value, of course. The index is the node's position in the backing array, and is updated by the heap whenever the node is moved. When an item is added to the heap, the Add method returns a reference to the node, which I maintain in an array. Or in your case you can maintain it in a queue.
So the algorithm looks like this:
queue = create empty queue of heap nodes
heap = create empty heap
for (i = 0; i < k; ++i)
{
node = heap.Add(array[i]);
queue.Add(node);
}
for (i = k; i < n-k; ++i)
{
if (heap.MaxItem - heap.MinItem) > threshold
output range
node = queue.Dequeue()
remove item at position node.Index from the heap
node = heap.Add(array[i])
queue.Add(node)
}
This is provably O(n log k). Every item is read and added to the heap. Actually, it's also removed from the heap. In addition, every item is added to the queue and removed from the queue, but those two operations are O(1).
For those of you who doubt me, it is possible to remove an arbitrary element from a heap in O(log k) time, provided that you know where it is. I explained the technique here: https://stackoverflow.com/a/8706363/56778.
So, if you have a window of size 4,000, running time will be roughly proportional to: 3n * 2(log k). Given a million items and a window size of 5,000, that works out to 3,000,000 * (12.3 * 2), or about 75 million. That's roughly equivalent to having to recompute the full window in your optimized naive method 200 times.
As I said, your optimized method can end up taking a long time if the array is sorted, or nearly so. The heap algorithm I outlined above doesn't suffer from that.
You should give your "better" algorithm a try and see if it's fast enough. If it is, and you don't expect pathological data, then great. Otherwise take a look at this technique.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With