30,000 data points, find greatest change over 2 weeks' time

Tags:

algorithm

I have:

- 30,000 data points
- each data point is a measurement of type float
- each measurement is associated with a date
- each date has only one measurement
- no dates are without measurements
- the data comes in the form of a text file: 30,000 lines in this form:
    - YYYY-MM-DD I,F (e.g. 1977-02-08 20.74)
- measurement appearing in the source file are already sorted by date

I need:

- a time-interval T with boundaries (s,e) /* start, end */
- (s - e = 14 days) the time-interval *must* be 2 weeks
- define min as the lowest value in the interval T
- define max as the greatest value in the interval T
- the chosen T needs to have the greatest distance btwn max and min of all possible Ts
- break ties among intervals T by choosing the most recent (with the greatest s value)
- the chosen T must consider all jumps in the 14 days, not just the values @ s and e
- if the overall "variance" in the interval is great but the jump 
  |max-min| is not the greatest in absolute value, T is not the right choice,
  even if it's an "exciting" interval

I am asking:

- which algorithm to employ, considering algorithms are not my specialty
- which data structure to use to keep track of the subtotals

Note:

- an answer in pseudo code would be preferred, "prose" is fine if pressured for time
- an answer in Python would be... splendid :)

If you want, you can generate "dummy" data and run your proposed algorithm as a test or I could share the actual data.

I am not concerned with performance so much here apart from wanting to know the fastest way to do this so as to learn how to apply the right solution and the correct algorithm.

I think I can "prove" correctness with even the simplest iterative algorithm because the dataset is small given today's computers.

So far, I am "traversing and carrying along 14 vectors of 14 measurements", if you could teach me how to do this incrementally with sub-sums, that would be really appreciated.

638

asked Jun 15 '12 02:06

2 Answers

Sliding windows do actually work here, by keeping two stacks (perhaps this is a little misleading, as this is probably best implemented as a doubly-ended queue). Keep a stack minstack and a stack called maxstack. The crux of the algorithm is that minstack should be strictly non-decreasing and maxstack should be strictly non-increasing at all points of the slide. So, how do we do that?

First, add the first 14 points to a stack. Let's define add(point) as:

Do this for the minstack:

While the point is smaller than the top element of minstack, remove the top element of minstack.
Add the point to minstack.

Similarly, for the maxstack:

While the new point is larger than the top element of maxstack, remove the top element of maxstack.
Add the point to maxstack.

Due to the property above, the min and max of the first 14 elements should be the bottom elements of minstack and maxstack. Now slide the window. We simply have to note that if the left point is still "alive" in any of the stacks, it's necessarily now the bottom point. Therefore this should be easy, it's simply:

slide():
    add(new_point)
    if (left_point == bottom(minstack)) remove_bottom(minstack)
    if (left_point == bottom(maxstack)) remove_bottom(maxstack)

Do this until your points are exhausted. The interval you're looking for is the one in which bottom(maxstack) - bottom(minstack) was largest.

Note that any point enters minstack/maxstack at most once, and every point leaves the stacks at most once as well, therefore this does at most 4 operations for each point, no matter what the size of the desired interval is.

EDIT: I just noticed you wanted an implementation in Python. I didn't really want to parse the data, so the function takes a list of values as input, and outputs the indices (s,e) in that array:

import collections

def add(x, minstack, maxstack):
    while minstack and x < minstack[-1]: minstack.pop()
    while maxstack and x > maxstack[-1]: maxstack.pop()
    minstack.append(x)
    maxstack.append(x)

def get_largest_interval(points):
    minstack = collections.deque()
    maxstack = collections.deque()

    best_diff = -1
    best_interval = None

    for index, elem in enumerate(points):
        add(elem,minstack,maxstack)
        if index >= 14:
            if minstack[0] == points[index-14]: minstack.popleft()
            if maxstack[0] == points[index-14]: maxstack.popleft()

        if index >= 13:
            this_diff = maxstack[0]-minstack[0]
            if best_diff == -1 or this_diff >= best_diff:
                best_interval = (index-13, index)
                best_diff = this_diff

    return best_interval


print get_largest_interval([0, 2, 2,2,2,2,2,2,2,2,2,2,2,2,3])

118

answered Sep 24 '22 23:09

ffao

If I understand you, you have:

30,000 distinct, ordered data values. The ordering happens to be by date, but that is not relevant.

Within this set, there are 29,986 subsets in which the contents are an ordered sequence that start at one data point and contain that initial point and thirteen following datapoints.

Taking it very slowly:

1) read your 30,000 data points into an array of size 30,000.

2) allocate an array of size 29,986. Call this array "Potential Winners".

3) fill the Potential Winners array by scanning each 14-point subset, temporarily holding the max value and the min value encountered in the subset. When those two values are in hand, save (Max-Min) at the index location --of the starting point-- within Potential Winners. Do not try any sliding windows optimizations; see below.

4) Do a linear scan of Potential Winners, saving the value and (importantly) the index at which it is located.

BTW: what do you do if there is no single winner? If all the datapoints have the same value, you will get 29,986 candidate winners, all with the same value.

5) Optimization: do not allocate and fill Potential Winners; initialize Current Winner to the tuple (value, index) as (0, -1). Compute the value of each 14-point subset as above, but retain only the better value among {Current Winner, "the value I get from this current subset"}

6) Sliding windows: I haven't thought this through, but I think that maintaining a sliding window is more work than the simple linear pass described above.

The reason: okay, compute the value of the first 14 points; get a min and max, and get the interval between them. But wait, we need the min and max values to use in the next window. Now slide the window one position up. The value at the left end is gone; but was it the min, the max, or in between? Suppose it was the min, and it is now gone. What value is the second lowest min? We don't have that info.

In order to keep a sliding window, you need to sort each 14-datapoint subsequence and remember the index position of all values. Then, when you slide, you can know if the value that dropped out at the left was either the old min or the old max, and whether the new value that came in on the right is either the new min or the new max. But it's not worth the effort.

(This situation reminds a little bit of the Boyer-Moore fast substring-finding algorithm. I don't remember the details, but it involves preprocessing the whole input and keeping a table of the locations where each value occurs. But this is way off-topic)

Hope this helps...

answered Sep 24 '22 23:09

Bill Torcaso

Related questions
                            
                                Looking for an optimal online assignment algorithm
                            
                                Approximate searching against a list of strings
                            
                                Multidimensional polynomial regression (preferably C/C++, Java or Scala)
                            
                                Finding common outline of multiple polygons
                            
                                Ant Colony Optimization or Genetic Algorithm for percentage based problem
                            
                                How to compute value of i-th digit in a k-ary representation of a number?
                            
                                Efficiently List All Sub-Directories in a Directory
                            
                                Polygon Clipping: Only "viewable" area
                            
                                looking for efficient algorithm (non-trivial) [closed]
                            
                                Unlucky Numbers
                            
                                Counting the number of shortest paths through a node in a DAG
                            
                                Why is insertion sort faster than quick-sort and bubble-sort for small cases?
                            
                                Number of permutation of a particular string is divisible by a number
                            
                                canvas "random" curved shapes
                            
                                Ensuring only a single formula could ever apply to 4 random elements
                            
                                Incrementally detecting dominators in DAGs
                            
                                Maximal SHA-1 Hash Performance Tips in Java
                            
                                Bidirectional spanning tree
                            
                                An algorithm for checking if a nonlinear function f is always positive
                            
                                Vertex-Coloring/Assignment to minimize the number of "color crossings"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With