Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elegant numpy array shifting and NaN filling?

Tags:

python

nan

numpy

I have a specific performance problem here. I'm working with meteorological forecast timeseries, which I compile into a numpy 2d array such that

  • dim0 = time at which forecast series starts
  • dim1 = the forecast horizon, eg. 0 to 120 hrs

Now, I would like dim0 to have hourly intervals, but some sources yield forecasts only every N hours. As an example, say N=3 and the time step in dim1 is M=1 hour. Then I get something like

12:00  11.2  12.2  14.0  15.0  11.3  12.0
13:00  nan   nan   nan   nan   nan   nan
14:00  nan   nan   nan   nan   nan   nan
15:00  14.7  11.5  12.2  13.0  14.3  15.1

But of course there is information at 13:00 and 14:00 as well, since it can be filled in from the 12:00 forecast run. So I would like to end up with something like this:

12:00  11.2  12.2  14.0  15.0  11.3  12.0
13:00  12.2  14.0  15.0  11.3  12.0  nan
14:00  14.0  15.0  11.3  12.0  nan   nan
15:00  14.7  11.5  12.2  13.0  14.3  15.1

What is the fastest way to get there, assuming dim0 is in the order of 1e4 and dim1 in the order of 1e2? Right now I'm doing it row by row but that is very slow:

nRows, nCols = dat.shape
if N >= M:
    assert(N % M == 0)  # must have whole numbers
    for i in range(1, nRows):
        k = np.array(np.where(np.isnan(self.dat[i, :])))
        k = k[k < nCols - N]  # do not overstep
        self.dat[i, k] = self.dat[i-1, k+N]

I'm sure there must be a more elegant way to do this? Any hints would be greatly appreciated.

like image 948
marfel Avatar asked Jul 26 '13 13:07

marfel


People also ask

How do you shift elements in a Numpy array?

To shift the bits of array elements of a 2D array to the left, use the numpy. left_shift() method in Python Numpy. Bits are shifted to the left by appending x2 0s at the right of x1. Since the internal representation of numbers is in binary format, this operation is equivalent to multiplying x1 by 2**x2.

How does Numpy array deal with NaN values?

How to drop all missing values from a numpy array? Droping the missing values or nan values can be done by using the function "numpy. isnan()" it will give us the indexes which are having nan values and when combined with other function which is "numpy. logical_not()" where the boolean values will be reversed.

What does .all do in Numpy?

all() in Python. The numpy. all() function tests whether all array elements along the mentioned axis evaluate to True.


2 Answers

Behold, the power of boolean indexing!!!

def shift_nans(arr) :
    while True:
        nan_mask = np.isnan(arr)
        write_mask = nan_mask[1:, :-1]
        read_mask = nan_mask[:-1, 1:]
        write_mask &= ~read_mask
        if not np.any(write_mask):
            return arr
        arr[1:, :-1][write_mask] = arr[:-1, 1:][write_mask]

I think the naming is self explanatory of what is going on. Getting the slicing right is a pain, but it seems to be working:

In [214]: shift_nans_bis(test_data)
Out[214]: 
array([[ 11.2,  12.2,  14. ,  15. ,  11.3,  12. ],
       [ 12.2,  14. ,  15. ,  11.3,  12. ,   nan],
       [ 14. ,  15. ,  11.3,  12. ,   nan,   nan],
       [ 14.7,  11.5,  12.2,  13. ,  14.3,  15.1],
       [ 11.5,  12.2,  13. ,  14.3,  15.1,   nan],
       [ 15.7,  16.5,  17.2,  18. ,  14. ,  12. ]])

And for timings:

tmp1 = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp1[nan_idx] = np.nan
tmp1 = tmp.copy()

import timeit

t1 = timeit.timeit(stmt='shift_nans(tmp)',
                   setup='from __main__ import tmp, shift_nans',
                   number=1)
t2 = timeit.timeit(stmt='shift_time(tmp1)', # Ophion's code
                   setup='from __main__ import tmp1, shift_time',
                   number=1)

In [242]: t1, t2
Out[242]: (0.12696346416487359, 0.3427293070417363)
like image 83
Jaime Avatar answered Nov 11 '22 04:11

Jaime


Slicing your data using a=yourdata[:,1:].

def shift_time(dat):

    #Find number of required iterations
    check=np.where(np.isnan(dat[:,0])==False)[0]
    maxiters=np.max(np.diff(check))-1

    #No sense in iterations where it just updates nans
    cols=dat.shape[1]
    if cols<maxiters: maxiters=cols-1

    for iters in range(maxiters):
        #Find nans
        col_loc,row_loc=np.where(np.isnan(dat[:,:-1]))

        dat[(col_loc,row_loc)]=dat[(col_loc-1,row_loc+1)]


a=np.array([[11.2,12.2,14.0,15.0,11.3,12.0],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[14.7,11.5,12.2,13.0,14.3,15.]])

shift_time(a)
print a

[[ 11.2  12.2  14.   15.   11.3  12. ]
 [ 12.2  14.   15.   11.3  12.    nan]
 [ 14.   15.   11.3  12.    nan   nan]
 [ 14.7  11.5  12.2  13.   14.3  15. ]]

To use your data as is or it can be changed slightly to take it directly, but this seems to be a clear way to show this:

shift_time(yourdata[:,1:]) #Updates in place, no need to return anything.

Using tiago's test:

tmp = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp[nan_idx] = np.nan

t=time.time()
shift_time(tmp,maxiter=1E5)
print time.time()-t

0.364198923111 (seconds)

If you are really clever you should be able to get away with a single np.where.

like image 26
Daniel Avatar answered Nov 11 '22 04:11

Daniel