Given a table where the first column is seconds past a certain reference point and the second one is an arbitrary measurement:
6 0.738158581
21 0.801697222
39 1.797224596
49 2.77920469
54 2.839757536
79 3.832232283
91 4.676794376
97 5.18244704
100 5.521878863
118 6.316630137
131 6.778507504
147 7.020395216
157 7.331607129
176 7.637492223
202 7.848079136
223 7.989456499
251 8.76853608
278 9.092367123
...
As you see, the measurements are sampled at irregular time points. I need to smooth the data by averaging the reading up to 100 seconds prior each measurement (in Python). Since the data table is huge, an iterator-based method is really preferred. Unfortunately, after two hours of coding I can't figure out efficient and elegant solution.
Can anyone help me?
EDITs
I want one smoothed reading for each raw reading, and the smoothed reading is to be the arithmetic mean of the raw reading and any others in the previous 100 (delta) seconds. (John, you are right)
Huge ~ 1e6 - 10e6 lines + need to work with tight RAM
The data is approximately random walk
The data is sorted
I have tested solutions proposed by J Machin and yairchu. They both gave the same results, however, on my data set, J Machin's version performs exponentially, while that of yairchu is linear. Following are execution times as measured by IPython's %timeit (in microseconds):
data size J Machin yairchu
10 90.2 55.6
50 930 258
100 3080 514
500 64700 2660
1000 253000 5390
2000 952000 11500
Thank you all for the help.
I'm using a sum result to which I'm adding the new members and subtracting the old ones. However in this way one may suffer accumulating floating point inaccuracies.
Therefore I implement a "Deque" with a list. And whenever my Deque reallocates to a smaller size. I recalculate the sum at the same occasion.
I'm also calculating the average up to point x including point x so there's at least one sample point to average.
def getAvgValues(data, avgSampleTime):
lastTime = 0
prevValsBuf = []
prevValsStart = 0
tot = 0
for t, v in data:
avgStart = t - avgSampleTime
# remove too old values
while prevValsStart < len(prevValsBuf):
pt, pv = prevValsBuf[prevValsStart]
if pt > avgStart:
break
tot -= pv
prevValsStart += 1
# add new item
tot += v
prevValsBuf.append((t, v))
# yield result
numItems = len(prevValsBuf) - prevValsStart
yield (t, tot / numItems)
# clean prevVals if it's time
if prevValsStart * 2 > len(prevValsBuf):
prevValsBuf = prevValsBuf[prevValsStart:]
prevValsStart = 0
# recalculate tot for not accumulating float precision error
tot = sum(v for (t, v) in prevValsBuf)
You haven't said exactly when you want output. I'm assuming that you want one smoothed reading for each raw reading, and the smoothed reading is to be the arithmetic mean of the raw reading and any others in the previous 100 (delta) seconds.
Short answer: use a collections.deque ... it will never hold more than "delta" seconds of readings. The way I've set it up you can treat the deque just like a list, and easily calculate the mean or some fancy gizmoid that gives more weight to recent readings.
Long answer:
>>> the_data = [tuple(map(float, x.split())) for x in """\
... 6 0.738158581
... 21 0.801697222
[snip]
... 251 8.76853608
... 278 9.092367123""".splitlines()]
>>> import collections
>>> delta = 100.0
>>> q = collections.deque()
>>> for t, v in the_data:
... while q and q[0][0] <= t - delta:
... # jettison outdated readings
... _unused = q.popleft()
... q.append((t, v))
... count = len(q)
... print t, sum(item[1] for item in q) / count, count
...
...
6.0 0.738158581 1
21.0 0.7699279015 2
39.0 1.112360133 3
49.0 1.52907127225 4
54.0 1.791208525 5
79.0 2.13137915133 6
91.0 2.49500989771 7
97.0 2.8309395405 8
100.0 3.12993279856 9
118.0 3.74976297144 9
131.0 4.41385300278 9
147.0 4.99420529389 9
157.0 5.8325615685 8
176.0 6.033109419 9
202.0 7.15545189083 6
223.0 7.4342562845 6
251.0 7.9150342134 5
278.0 8.4246097095 4
>>>
Edit
One-stop shop: get your fancy gizmoid here. Here's the code:
numerator = sum(item[1] * upsilon ** (t - item[0]) for item in q)
denominator = sum(upsilon ** (t - item[0]) for item in q)
gizmoid = numerator / denominator
where upsilon should be a little less than 1.0 (<= zero is illegal, just above zero does little smoothing, one gets you the arithmetic mean plus wasted CPU time, and greater than one gives the inverse of your purpose).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With