my software visualizes very large datasets, e.g. the data is so large I can't store all the data in RAM at any one time it is required to be loaded in a page fashion. I embed matplotlib
functionality for displaying and manipulating the plot in the backend of my application.
These datasets contains three internal lists I use to visualize: time
, height
and dataset
. My program plots the data with time x height , and additionally users have the options of drawing shapes around regions of the graph that can be extracted to a whole different plot.
The difficult part is, when I want to extract the data from the shapes, the shape vertices are real coordinates computed by the plot, not rounded to the nearest point in my time
array. Here's an example of a shape which bounds a region in my program
While X1
may represent the coordinate (2007-06-12 03:42:20.070901+00:00, 5.2345)
according to matplotlib, the closest coordinate existing in time
and height
might be something like (2007-06-12 03:42:20.070801+00:00, 5.219)
, only a small bit off from matploblib's coordinate.
So given some arbitrary value, lets say x1 = 732839.154395
(a representation of the date in number format) and a list of similar values with a constant step:
732839.154392
732839.154392
732839.154393
732839.154393
732839.154394
732839.154394
732839.154395
732839.154396
732839.154396
732839.154397
732839.154397
732839.154398
732839.154398
732839.154399
etc...
What would be the most efficient way of finding the closest representation of that point? I could simply loop through the list and grab the value with the smallest different, but the size of time
is huge. Since I know the array is 1. Sorted and 2. Increments with a constant step , I was thinking this problem should be able to be solved in O(1)
time? Is there a known algorithm that solves these kind of problems? Or would I simply need to devise some custom algorithm, here is my current thought process.
grab first and second element of time
subtract second element of time with first, obtain step
subtract bounding x value with first element of time, obtain difference
divide difference by step, obtain index
move time forward to index
check surrounding elements of index to ensure closest representation
The algorithm you suggest seems reasonable and like it would work.
As has become clear in your comments, the problem with it is the coarseness at which your time was recorded. (This can be common when unsynchronized data is recorded -- ie, the data generation clock, eg, frame rate, is not synced with the computer).
The easy way around this is to read two points separated by a larger time, so for example, read the first time value and then the 1000th time value. Then everything stays the same in your calculation but get you timestep by subtracting and then dividing to 1000
Here's a test that makes data a similar to yours:
import matplotlib.pyplot as plt
start = 97523.29783
increment = .000378912098
target = 97585.23452
# build a timeline
times = []
time = start
actual_index = None
for i in range(1000000):
trunc = float(str(time)[:10]) # truncate the time value
times.append(trunc)
if actual_index is None and time>target:
actual_index = i
time = time + increment
# now test
intervals = [1, 2, 5, 10, 100, 1000, 10000]
for i in intervals:
dt = (times[i] - times[0])/i
index = int((target-start)/dt)
print " %6i %8i %8i %.10f" % (i, actual_index, index, dt)
Result:
span actual guess est dt (actual=.000378912098)
1 163460 154841 0.0004000000
2 163460 176961 0.0003500000
5 163460 162991 0.0003800000
10 163460 162991 0.0003800000
100 163460 163421 0.0003790000
1000 163460 163464 0.0003789000
10000 163460 163460 0.0003789100
That is, as the space between the sampled points gets larger, the time interval estimate gets more accurate (compare to increment
in the program) and the estimated index (3rd col) gets closer to the actual index (2nd col). Note that the accuracy of the dt
estimate is basically just proportional to the number of digits in the span. The best you could do is use the times at the start and end points, but it seemed from you question statement that this would be difficult; but if it's not, it will give the most accurate estimate of your time interval. Note that here, for clarity, I exaggerated the lack of accuracy by making my time interval recording very course, but in general, every power of 10 in your span increase your accuracy by the same amount.
As an example of that last point, if I reduce the courseness of the time values by changing the coursing line to, trunc = float(str(time)[:12])
, I get:
span actual guess est dt (actual=.000378912098)
1 163460 163853 0.0003780000
10 163460 163464 0.0003789000
100 163460 163460 0.0003789100
1000 163460 163459 0.0003789120
10000 163460 163459 0.0003789121
So if, as you say, using a span of 1 gets you very close, using a span of 100 or 1000 should be more than enough.
Overall, this is very similar in idea to the linear "interpolation search". It's just a bit easier to implement because it's only making a single guess based on the interpolation, so it just takes one line of code: int((target-start)*i/(times[i] - times[0]))
What you're describing is pretty much interpolation search. It works very much like binary search, but instead of choosing the middle element it assumes the distribution is close to uniform and guesses the approximate location.
The wikipedia link contains a C++ implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With