Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

30,000 data points, find greatest change over 2 weeks' time

Tags:

algorithm

I have:

- 30,000 data points
- each data point is a measurement of type float
- each measurement is associated with a date
- each date has only one measurement
- no dates are without measurements
- the data comes in the form of a text file: 30,000 lines in this form:
    - YYYY-MM-DD I,F (e.g. 1977-02-08 20.74)
- measurement appearing in the source file are already sorted by date

I need:

- a time-interval T with boundaries (s,e) /* start, end */
- (s - e = 14 days) the time-interval *must* be 2 weeks
- define min as the lowest value in the interval T
- define max as the greatest value in the interval T
- the chosen T needs to have the greatest distance btwn max and min of all possible Ts
- break ties among intervals T by choosing the most recent (with the greatest s value)
- the chosen T must consider all jumps in the 14 days, not just the values @ s and e
- if the overall "variance" in the interval is great but the jump 
  |max-min| is not the greatest in absolute value, T is not the right choice,
  even if it's an "exciting" interval

I am asking:

- which algorithm to employ, considering algorithms are not my specialty
- which data structure to use to keep track of the subtotals

Note:

- an answer in pseudo code would be preferred, "prose" is fine if pressured for time
- an answer in Python would be... splendid :)

If you want, you can generate "dummy" data and run your proposed algorithm as a test or I could share the actual data.

I am not concerned with performance so much here apart from wanting to know the fastest way to do this so as to learn how to apply the right solution and the correct algorithm.

I think I can "prove" correctness with even the simplest iterative algorithm because the dataset is small given today's computers.

So far, I am "traversing and carrying along 14 vectors of 14 measurements", if you could teach me how to do this incrementally with sub-sums, that would be really appreciated.

like image 638
Robottinosino Avatar asked Jun 15 '12 02:06

Robottinosino


People also ask

How much has the market dropped in 2022?

The tech-heavy Nasdaq 100 has dropped nearly 33 percent so far in 2022, the Dow Jones Industrial Average lost more than 20 percent while the world's best-known cryptocurrency, Bitcoin, shed nearly 60 percent of its value.

What is hybrid work Microsoft?

​​​​​​​At Microsoft, we value and support flexibility as part of our hybrid workplace where every employee can do their best work by working the way they work best. A hybrid workplace is a mix of workstyles across three dimensions: work site, work location, and work hours.

What is hybrid work model?

A hybrid workplace model mixes in-office and remote work to offer flexibility and support to employees. In a hybrid workplace, employees typically enjoy more autonomy and better work-life balance – and are more engaged as a result.


2 Answers

Sliding windows do actually work here, by keeping two stacks (perhaps this is a little misleading, as this is probably best implemented as a doubly-ended queue). Keep a stack minstack and a stack called maxstack. The crux of the algorithm is that minstack should be strictly non-decreasing and maxstack should be strictly non-increasing at all points of the slide. So, how do we do that?

First, add the first 14 points to a stack. Let's define add(point) as:

Do this for the minstack:

  • While the point is smaller than the top element of minstack, remove the top element of minstack.
  • Add the point to minstack.

Similarly, for the maxstack:

  • While the new point is larger than the top element of maxstack, remove the top element of maxstack.
  • Add the point to maxstack.

Due to the property above, the min and max of the first 14 elements should be the bottom elements of minstack and maxstack. Now slide the window. We simply have to note that if the left point is still "alive" in any of the stacks, it's necessarily now the bottom point. Therefore this should be easy, it's simply:

slide():
    add(new_point)
    if (left_point == bottom(minstack)) remove_bottom(minstack)
    if (left_point == bottom(maxstack)) remove_bottom(maxstack)

Do this until your points are exhausted. The interval you're looking for is the one in which bottom(maxstack) - bottom(minstack) was largest.

Note that any point enters minstack/maxstack at most once, and every point leaves the stacks at most once as well, therefore this does at most 4 operations for each point, no matter what the size of the desired interval is.

EDIT: I just noticed you wanted an implementation in Python. I didn't really want to parse the data, so the function takes a list of values as input, and outputs the indices (s,e) in that array:

import collections

def add(x, minstack, maxstack):
    while minstack and x < minstack[-1]: minstack.pop()
    while maxstack and x > maxstack[-1]: maxstack.pop()
    minstack.append(x)
    maxstack.append(x)

def get_largest_interval(points):
    minstack = collections.deque()
    maxstack = collections.deque()

    best_diff = -1
    best_interval = None

    for index, elem in enumerate(points):
        add(elem,minstack,maxstack)
        if index >= 14:
            if minstack[0] == points[index-14]: minstack.popleft()
            if maxstack[0] == points[index-14]: maxstack.popleft()

        if index >= 13:
            this_diff = maxstack[0]-minstack[0]
            if best_diff == -1 or this_diff >= best_diff:
                best_interval = (index-13, index)
                best_diff = this_diff

    return best_interval


print get_largest_interval([0, 2, 2,2,2,2,2,2,2,2,2,2,2,2,3])
like image 118
ffao Avatar answered Sep 24 '22 23:09

ffao


If I understand you, you have:

30,000 distinct, ordered data values. The ordering happens to be by date, but that is not relevant.

Within this set, there are 29,986 subsets in which the contents are an ordered sequence that start at one data point and contain that initial point and thirteen following datapoints.


Taking it very slowly:

1) read your 30,000 data points into an array of size 30,000.

2) allocate an array of size 29,986. Call this array "Potential Winners".

3) fill the Potential Winners array by scanning each 14-point subset, temporarily holding the max value and the min value encountered in the subset. When those two values are in hand, save (Max-Min) at the index location --of the starting point-- within Potential Winners. Do not try any sliding windows optimizations; see below.

4) Do a linear scan of Potential Winners, saving the value and (importantly) the index at which it is located.

BTW: what do you do if there is no single winner? If all the datapoints have the same value, you will get 29,986 candidate winners, all with the same value.

5) Optimization: do not allocate and fill Potential Winners; initialize Current Winner to the tuple (value, index) as (0, -1). Compute the value of each 14-point subset as above, but retain only the better value among {Current Winner, "the value I get from this current subset"}

6) Sliding windows: I haven't thought this through, but I think that maintaining a sliding window is more work than the simple linear pass described above.

The reason: okay, compute the value of the first 14 points; get a min and max, and get the interval between them. But wait, we need the min and max values to use in the next window. Now slide the window one position up. The value at the left end is gone; but was it the min, the max, or in between? Suppose it was the min, and it is now gone. What value is the second lowest min? We don't have that info.

In order to keep a sliding window, you need to sort each 14-datapoint subsequence and remember the index position of all values. Then, when you slide, you can know if the value that dropped out at the left was either the old min or the old max, and whether the new value that came in on the right is either the new min or the new max. But it's not worth the effort.

(This situation reminds a little bit of the Boyer-Moore fast substring-finding algorithm. I don't remember the details, but it involves preprocessing the whole input and keeping a table of the locations where each value occurs. But this is way off-topic)



Hope this helps...

like image 29
Bill Torcaso Avatar answered Sep 24 '22 23:09

Bill Torcaso