I have a specific performance problem here. I'm working with meteorological forecast timeseries, which I compile into a numpy 2d array such that
Now, I would like dim0 to have hourly intervals, but some sources yield forecasts only every N hours. As an example, say N=3 and the time step in dim1 is M=1 hour. Then I get something like
12:00 11.2 12.2 14.0 15.0 11.3 12.0
13:00 nan nan nan nan nan nan
14:00 nan nan nan nan nan nan
15:00 14.7 11.5 12.2 13.0 14.3 15.1
But of course there is information at 13:00 and 14:00 as well, since it can be filled in from the 12:00 forecast run. So I would like to end up with something like this:
12:00 11.2 12.2 14.0 15.0 11.3 12.0
13:00 12.2 14.0 15.0 11.3 12.0 nan
14:00 14.0 15.0 11.3 12.0 nan nan
15:00 14.7 11.5 12.2 13.0 14.3 15.1
What is the fastest way to get there, assuming dim0 is in the order of 1e4 and dim1 in the order of 1e2? Right now I'm doing it row by row but that is very slow:
nRows, nCols = dat.shape
if N >= M:
assert(N % M == 0) # must have whole numbers
for i in range(1, nRows):
k = np.array(np.where(np.isnan(self.dat[i, :])))
k = k[k < nCols - N] # do not overstep
self.dat[i, k] = self.dat[i-1, k+N]
I'm sure there must be a more elegant way to do this? Any hints would be greatly appreciated.
To shift the bits of array elements of a 2D array to the left, use the numpy. left_shift() method in Python Numpy. Bits are shifted to the left by appending x2 0s at the right of x1. Since the internal representation of numbers is in binary format, this operation is equivalent to multiplying x1 by 2**x2.
How to drop all missing values from a numpy array? Droping the missing values or nan values can be done by using the function "numpy. isnan()" it will give us the indexes which are having nan values and when combined with other function which is "numpy. logical_not()" where the boolean values will be reversed.
all() in Python. The numpy. all() function tests whether all array elements along the mentioned axis evaluate to True.
Behold, the power of boolean indexing!!!
def shift_nans(arr) :
while True:
nan_mask = np.isnan(arr)
write_mask = nan_mask[1:, :-1]
read_mask = nan_mask[:-1, 1:]
write_mask &= ~read_mask
if not np.any(write_mask):
return arr
arr[1:, :-1][write_mask] = arr[:-1, 1:][write_mask]
I think the naming is self explanatory of what is going on. Getting the slicing right is a pain, but it seems to be working:
In [214]: shift_nans_bis(test_data)
Out[214]:
array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ 12.2, 14. , 15. , 11.3, 12. , nan],
[ 14. , 15. , 11.3, 12. , nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ 11.5, 12.2, 13. , 14.3, 15.1, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
And for timings:
tmp1 = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp1[nan_idx] = np.nan
tmp1 = tmp.copy()
import timeit
t1 = timeit.timeit(stmt='shift_nans(tmp)',
setup='from __main__ import tmp, shift_nans',
number=1)
t2 = timeit.timeit(stmt='shift_time(tmp1)', # Ophion's code
setup='from __main__ import tmp1, shift_time',
number=1)
In [242]: t1, t2
Out[242]: (0.12696346416487359, 0.3427293070417363)
Slicing your data using a=yourdata[:,1:]
.
def shift_time(dat):
#Find number of required iterations
check=np.where(np.isnan(dat[:,0])==False)[0]
maxiters=np.max(np.diff(check))-1
#No sense in iterations where it just updates nans
cols=dat.shape[1]
if cols<maxiters: maxiters=cols-1
for iters in range(maxiters):
#Find nans
col_loc,row_loc=np.where(np.isnan(dat[:,:-1]))
dat[(col_loc,row_loc)]=dat[(col_loc-1,row_loc+1)]
a=np.array([[11.2,12.2,14.0,15.0,11.3,12.0],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[14.7,11.5,12.2,13.0,14.3,15.]])
shift_time(a)
print a
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ 14. 15. 11.3 12. nan nan]
[ 14.7 11.5 12.2 13. 14.3 15. ]]
To use your data as is or it can be changed slightly to take it directly, but this seems to be a clear way to show this:
shift_time(yourdata[:,1:]) #Updates in place, no need to return anything.
Using tiago's test:
tmp = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp[nan_idx] = np.nan
t=time.time()
shift_time(tmp,maxiter=1E5)
print time.time()-t
0.364198923111 (seconds)
If you are really clever you should be able to get away with a single np.where
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With