I am currently trying to process an experimental timeseries dataset, which has missing values. I would like to calculate the sliding windowed mean of this dataset along time, while handling nan values. The correct way for me to do it is to compute inside each window the sum of the finite elements and divide it with their number. This nonlinearity forces me to use non convolutional methods to face this problem, thus I have a severe time bottleneck in this part of the process. As a code example of what I am trying to accomplish I present the following:
import numpy as np
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data= np.random.random(50)
data[np.random.randint(0,n-1, n_miss)] = None
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
print 'Input:\t',data
print 'Output:\t',result
with output:
Input: [ 0.47431791 0.17620835 0.78495647 0.79894688 0.58334064 0.38068788
0.87829696 nan 0.71589171 nan 0.70359557 0.76113969
0.13694387 0.32126573 0.22730891 nan 0.35057169 nan
0.89251851 0.56226354 0.040117 nan 0.37249799 0.77625334
nan nan nan nan 0.63227417 0.92781944
0.99416471 0.81850753 0.35004997 nan 0.80743783 0.60828597
nan 0.01410721 nan nan 0.6976317 nan
0.03875394 0.60924066 0.22998065 nan 0.34476729 0.38090961
nan 0.2021964 ]
Output: [ 0.32526313 0.47849424 0.5867039 0.72241466 0.58765847 0.61410849
0.62949242 0.79709433 0.71589171 0.70974364 0.73236763 0.53389305
0.40644977 0.22850617 0.27428732 0.2889403 0.35057169 0.6215451
0.72739103 0.49829968 0.30119027 0.20630749 0.57437567 0.57437567
0.77625334 nan nan 0.63227417 0.7800468 0.85141944
0.91349722 0.7209074 0.58427875 0.5787439 0.7078619 0.7078619
0.31119659 0.01410721 0.01410721 0.6976317 0.6976317 0.36819282
0.3239973 0.29265842 0.41961066 0.28737397 0.36283845 0.36283845
0.29155301 0.2021964 ]
Can this result be produced by numpy operations, without using a for loop?
You can do that using the rolling
function of Pandas:
import numpy as np
import pandas as pd
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data = np.random.random(n)
data[np.random.randint(0, n-1, n_miss)] = None
windowed_mean = pd.Series(data).rolling(window=win_size, min_periods=1).mean()
print(pd.DataFrame({'Data': data, 'Windowed mean': windowed_mean}) )
Output:
Data Windowed mean
0 0.589376 0.589376
1 0.639173 0.614274
2 0.343534 0.524027
3 0.250329 0.411012
4 0.911952 0.501938
5 NaN 0.581141
6 0.224964 0.568458
7 NaN 0.224964
8 0.508419 0.366692
9 0.215418 0.361918
10 NaN 0.361918
11 0.638118 0.426768
12 0.587478 0.612798
13 0.097037 0.440878
14 0.688689 0.457735
15 0.858593 0.548107
16 0.408903 0.652062
17 0.448993 0.572163
18 NaN 0.428948
19 0.877453 0.663223
20 NaN 0.877453
21 NaN 0.877453
22 0.021798 0.021798
23 0.482054 0.251926
24 0.092387 0.198746
25 0.251766 0.275402
26 0.093854 0.146002
27 NaN 0.172810
28 NaN 0.093854
29 NaN NaN
30 0.965669 0.965669
31 0.695999 0.830834
32 NaN 0.830834
33 NaN 0.695999
34 NaN NaN
35 0.613727 0.613727
36 0.837533 0.725630
37 NaN 0.725630
38 0.782295 0.809914
39 NaN 0.782295
40 0.777429 0.779862
41 0.401355 0.589392
42 0.491709 0.556831
43 0.127813 0.340292
44 0.781625 0.467049
45 0.960466 0.623301
46 0.637618 0.793236
47 0.651264 0.749782
48 0.154911 0.481264
49 0.159145 0.321773
Here's a convolution based approach using np.convolve
-
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
Please note that this would have one extra element on either sides.
If you are working with 2D
data, we can use Scipy's 2D convolution
.
Approaches -
def original_app(data, win_size):
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): \
min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
return result
def numpy_app(data, win_size):
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
return out[1:-1] # Slice out the one-extra elems on sides
Sample run -
In [118]: #Construct sample data
...: n = 50
...: n_miss = 20
...: win_size = 3
...: data= np.random.random(50)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [119]: original_app(data, win_size = 3)
Out[119]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
In [120]: numpy_app(data, win_size = 3)
__main__:36: RuntimeWarning: invalid value encountered in divide
Out[120]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
Runtime test -
In [122]: #Construct sample data
...: n = 50000
...: n_miss = 20000
...: win_size = 3
...: data= np.random.random(n)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [123]: %timeit original_app(data, win_size = 3)
1 loops, best of 3: 1.51 s per loop
In [124]: %timeit numpy_app(data, win_size = 3)
1000 loops, best of 3: 1.09 ms per loop
In [125]: import pandas as pd
# @jdehesa's pandas solution
In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()
100 loops, best of 3: 3.34 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With