Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I make an O(1) search algorithm using a sorted array with a known step?

Background

my software visualizes very large datasets, e.g. the data is so large I can't store all the data in RAM at any one time it is required to be loaded in a page fashion. I embed matplotlib functionality for displaying and manipulating the plot in the backend of my application.

These datasets contains three internal lists I use to visualize: time, height and dataset. My program plots the data with time x height , and additionally users have the options of drawing shapes around regions of the graph that can be extracted to a whole different plot.

The difficult part is, when I want to extract the data from the shapes, the shape vertices are real coordinates computed by the plot, not rounded to the nearest point in my time array. Here's an example of a shape which bounds a region in my program

enter image description here

While X1 may represent the coordinate (2007-06-12 03:42:20.070901+00:00, 5.2345) according to matplotlib, the closest coordinate existing in time and height might be something like (2007-06-12 03:42:20.070801+00:00, 5.219) , only a small bit off from matploblib's coordinate.


The Problem

So given some arbitrary value, lets say x1 = 732839.154395 (a representation of the date in number format) and a list of similar values with a constant step:

732839.154392
732839.154392
732839.154393
732839.154393
732839.154394
732839.154394
732839.154395
732839.154396 
732839.154396
732839.154397
732839.154397
732839.154398
732839.154398
732839.154399
etc...

What would be the most efficient way of finding the closest representation of that point? I could simply loop through the list and grab the value with the smallest different, but the size of time is huge. Since I know the array is 1. Sorted and 2. Increments with a constant step , I was thinking this problem should be able to be solved in O(1) time? Is there a known algorithm that solves these kind of problems? Or would I simply need to devise some custom algorithm, here is my current thought process.

grab first and second element of time
subtract second element of time with first, obtain step
subtract bounding x value with first element of time, obtain difference
divide difference by step, obtain index
move time forward to index
check surrounding elements of index to ensure closest representation
like image 887
Syntactic Fructose Avatar asked Jul 15 '15 13:07

Syntactic Fructose


2 Answers

The algorithm you suggest seems reasonable and like it would work.

As has become clear in your comments, the problem with it is the coarseness at which your time was recorded. (This can be common when unsynchronized data is recorded -- ie, the data generation clock, eg, frame rate, is not synced with the computer).

The easy way around this is to read two points separated by a larger time, so for example, read the first time value and then the 1000th time value. Then everything stays the same in your calculation but get you timestep by subtracting and then dividing to 1000

Here's a test that makes data a similar to yours:

import matplotlib.pyplot as plt

start = 97523.29783
increment = .000378912098
target = 97585.23452

# build a timeline
times = []
time = start
actual_index = None
for i in range(1000000):
    trunc = float(str(time)[:10])  # truncate the time value
    times.append(trunc)
    if actual_index is None and time>target:
        actual_index = i
    time = time + increment

# now test
intervals = [1, 2, 5, 10, 100, 1000, 10000]

for i in intervals:
    dt = (times[i] - times[0])/i
    index = int((target-start)/dt)
    print "    %6i  %8i  %8i  %.10f" % (i, actual_index, index, dt)

Result:

  span    actual     guess  est dt (actual=.000378912098)
     1    163460    154841  0.0004000000
     2    163460    176961  0.0003500000
     5    163460    162991  0.0003800000
    10    163460    162991  0.0003800000
   100    163460    163421  0.0003790000
  1000    163460    163464  0.0003789000
 10000    163460    163460  0.0003789100

That is, as the space between the sampled points gets larger, the time interval estimate gets more accurate (compare to increment in the program) and the estimated index (3rd col) gets closer to the actual index (2nd col). Note that the accuracy of the dt estimate is basically just proportional to the number of digits in the span. The best you could do is use the times at the start and end points, but it seemed from you question statement that this would be difficult; but if it's not, it will give the most accurate estimate of your time interval. Note that here, for clarity, I exaggerated the lack of accuracy by making my time interval recording very course, but in general, every power of 10 in your span increase your accuracy by the same amount.

As an example of that last point, if I reduce the courseness of the time values by changing the coursing line to, trunc = float(str(time)[:12]), I get:

  span    actual     guess  est dt (actual=.000378912098)
     1    163460    163853  0.0003780000
    10    163460    163464  0.0003789000
   100    163460    163460  0.0003789100
  1000    163460    163459  0.0003789120
 10000    163460    163459  0.0003789121

So if, as you say, using a span of 1 gets you very close, using a span of 100 or 1000 should be more than enough.

Overall, this is very similar in idea to the linear "interpolation search". It's just a bit easier to implement because it's only making a single guess based on the interpolation, so it just takes one line of code: int((target-start)*i/(times[i] - times[0]))

like image 152
tom10 Avatar answered Nov 11 '22 13:11

tom10


What you're describing is pretty much interpolation search. It works very much like binary search, but instead of choosing the middle element it assumes the distribution is close to uniform and guesses the approximate location.

The wikipedia link contains a C++ implementation.

like image 21
viraptor Avatar answered Nov 11 '22 14:11

viraptor